Extracting randomness from text

My BookPad cipher seems to be closely related to the running key cipher, since both take a long piece of text from a book and use it as key to encrypt a plaintext. Yet while the running key cipher can be broken easily, BookPad offers a level of security comparable to that of a one-time-pad. In this article, I try to explain why in layman’s terms. As a bonus, I introduce TripleText, a variant of BookPad where all the operations are done directly with letters.

A little spoiler: not much prevented the guy on the left from discovering this, back in the early XVI century. Had he discovered it, history might have turned out quite different.

This JavaScript version of BookPad displays the result of several randomness tests performed on the keystream right before it is used to encrypt a message, with and without using a lagged Fibonacci generator (LFG) step, or a key text shift. Big surprise: when a piece of real text is placed in the key text box, the keystream usually passes all the randomness tests even if the process of randomizing the digits by means of a LFG is skipped. This means that a user may be able to skip half the work and still get something approaching perfect security.
First, let’s try to define what “passing the tests” means. To do this, I headed to random.org, which provides strings of numbers from true random sources such as atmospheric noise, and got 1000 integers in the 0 to 999,999,999 range. I added the missing zeroes on the right of some numbers by hand in order to obtain a string of 9000 supposedly perfectly random decimal digits, which I won’t print here. Then I pasted it into the key text box of BookPad, opened up a JavaScript console, and applied several tests to this box. The tests were:

  1. Chi-squared statistic of single digits, compared to a uniform distribution (probability for each value = 1/10). Since a decimal digit string has 9 degrees of freedom (10 -1), theory predicts that a value of less than 14.7 means less than 10% probability that the stream is not random (see this table, for instance). The sample gave 11.65
  2. Chi-squared statistic of pairs of digits, again compared to a uniform distribution (probability for each pair = 1/100). Here the degrees of freedom are 9*9 = 81, so to have less than 10% probability that the stream is not random, the statistic should be less than 97.7. The sample gave 105.57
  3. Durbin-Watson statistic, which estimates the correlation between pairs of consecutive digits. If there is no correlation the value should be 2.0. Less than that, down to a minimum of 0, means that they are negatively correlated; more than that, up to a maximum of 4, means they are positively correlated. The sample gave 2.01
  4. Shannon entropy, which looks at single digits and determines how much unpredictability each one has, in bits per digit, up to a maximum of  log(10)/log(2) = 3.32193 for perfect randomness. This is related to the chi-squared statistic because it is derived strictly from single-digit frequencies. The sample gave 3.32099 bits/digit
  5. Runs test. Here we look at how often the digits switch from low (0 to 4) to high (5 to 9), or vice-versa. A perfectly random sequence, where each digits is truly independent of the previous digits, would switch every two digits on the average, but one that is not random may switch more or less often due to lingering correlations between consecutive digits. The sample gave 2.0008 digits/run

So, the “true random” sample passed all the tests (at 10% confidence, for the chi-squared tests) except the two-digit chi-squared test, but this one was close. It is likely that the sample was not long enough for this test, which is extremely sensitive, to come down to its expected value for infinitely long samples, so we’ll take a value less than 105 to be acceptable for a sample of less than 10000 digits.

So how does a piece of real test fare? I headed to gutenberg.org and got the text version of “Oliver Twist,” by Charles Dickens. I copied chapter 1 (no headings) into the plain text box of BookPad, resulting in a decimal-encoded text of length 8220 digits. I need at least three times that length in the key text box, so I copied chapter 2 (again no headings) into that box, which the JavaScript version of BookPad took as sufficiently long. Here are the statistics for the keystream generated from the key text with the default “Arsenio” encoding and no shift being used, before the lagged Fibonacci step, and after:

  • Before LFG: chi-squared = 12.476. Two-digit chi-squared = 114.75. DW statistic = 1.979. Shannon’s entropy = 3.3208 bits/digit. Runs = 1.9985 digits/run.
  • After LFG: chi-squared = 8.6155. Two-digit chi-squared = 118.93. DW statistic = 2.009. Shannon’s entropy = 3.3211 bits/digit. Runs = 2.0112 digits/run.

Both streams passed the same tests as the true random sample, missing two-digit chi-squared by a little more, but not a lot (possibly because the text sample is a little shorter). The LFG-processed keystream actually beat the “true random” sample in three tests.

In other words, they may not be truly random, but we can’t tell the difference. Even before the LFG step, which essentially doubles the amount of computation, the keystream is good enough for the purpose of encrypting a message by the one-time pad method.

I’ve performed the same test with a number of different texts (not always taken from Dickens) and in different languages: English, Spanish, German (lots of tildes and umlauts eliminated before encoding into numbers, in the last two). The result is always the same, except that oftentimes the keystream before LFG actually beats the one after the LFG. Therefore, I have revised the official description of the BookPad cipher to remove the LFG. The matching JavaScript code can be found here.

Now, how can this be possible? One of the first things you learn in spy school is to never reuse a one-time pad, because then an adversary can subtract the two ciphertexts, thus eliminating the keystream and obtaining as a result the difference between the two encoded plaintexts, which can then be attacked by statistical methods. In other words, the combination of two encoded plaintexts is expected to be sufficiently non-random to allow statistical attack. But BookPad takes the encoded key text, splits it into three parts of equal length, and simply adds them together (digit by digit, without carries) before the LFG step. So it seems that adding or subtracting two pieces of encoded text leaves statistical signatures in the result, but adding three pieces does not, if we have to believe the test results. How can this be?

Indeed, just encoding the key text does not make it random. The chi-squared statistic of Oliver Twist’s chapter 2, after encoding (29549 digits), is 5629.28 and the two-digit chi-squared is 28898.48. Durbin-Watson looks a little better at 1.9615, Shannon entropy is 3.1845 bits/digit, and runs take 2.10924 digits/run. If we split it into two halves and do a mod 10 addition of those, the chi-squared of the result is a much-lower-but-not-quite-low-enough 63.0326, the two-digit chi-squared drops to 330.27, Durbin-Watson is 2.02, Shannon entropy 3.3188 bits/digit, runs is 2.033 digits/run. If we subtract the parts rather than add them, we get statistics similar to those of the addition. That is, it is still possible to tell the result apart from random, at least through the chi-squared tests. This is the essential flaw of the running key cipher, so similar to BookPad, except that the key text is used straight, without trying to concentrate its entropy in any way. So straight encoded text, or text added or subtracted to another text retain exploitable statistics from the original texts, and yet if we involve three pieces of encoded text instead of two, the result appears to be as statistically random as a true random source. How come?

Here’s the way I would explain it. It is by no means a mathematically acceptable proof, but I think it captures the essence of what is going on:

Recall that we had decided to use a piece of key text that is three times the length of the message (after encoding) because of this calculation: 3.32193 (entropy in bits/digit of a random string of decimal digits) x 1.33 (digits/character in encoding) / 1.58 (average entropy of text in bits/character) = 2.8, so we chose 3 to be sure. The idea was that the piece of key text used would have the same entropy as a decimal random string of length equal to that of the encoded message, so we could use it to make a sort of one-time pad. Since the straddling checkerboard encoding we are using produces about 1.33 digits/character (determined experimentally), we wanted to satisfy this formula: 3.32193 x (1.33 x message length) = entropy needed = entropy supplied = message length x factor x 1.58 (average entropy of text) in order to find the factor indicating how many times longer (before or after encoding) the key text must be, relative to the plaintext.

In other words, a piece of key text three times the length of the plaintext (before or after encoding) is expected to have enough entropy to generate a keystream for that plaintext length that would be statistically indistinguishable from a true random keystream. This is what we observe. Typically,randomness extraction (this is the technical term for what we are doing) requires the use of a one-way function or hash possessing several strict properties, but here we are achieving a comparable result with a simple sum.

The LFG step in BookPad is intended to spread the effect of any single-digit changes in the sum to the complete keystream, which is one of the important properties of a hash, plus smoothing non-uniformities that might be present in the sum, and indeed it does improve the keystream considerably if the key text is too short. If we use Oliver Twist’s chapter 1 as key text (same as the plaintext), chi-squared becomes 23.058 and two-digit chi-squared is 118.38 before the LFG step, but after the LFG step they become 6.6204 and 74.450, respectively. The effect intensifies as the key text shortens, and becomes especially pronounced for keys containing 10 or fewer words (which, of course, should never be used to make a one-time pad replacement), so that the keystream after LFG passes the randomness tests even though it starts repeating after a certain point.

Going back to the result of the sum before the LFG is applied, the crucial question is whether or not the process of cutting the encoded key text into three parts and adding them preserves the entropy of the key text. Let’s say its average entropy is one-third of the random value, that is, 3.32193 / 3 = 1.1073 bits/digit (it actually a little higher, as we saw earlier). Let’s further assume for the time being that consecutive digits are independent of each other (they are not, since in the encoding used consecutive digits often derive from a single letter) so that the entropy of a group of digits is the sum of the entropies of all the digits. If we take groups of three consecutive digits, each group will have an entropy of (3.32193 / 3) x 3 bits/digit, which is the same as that of a truly random decimal digit.

So let us now make a code that assigns a single-digit value to each group of three digits. Since there are only 10 different single digits but 1000 different groups of three digits (10 x 10 x 10), this code will convert 100 different groups to the same decimal digit, which is okay because each group of three isn’t found so often anyway. Any function that does this will work: sums, subtractions, more complex calculations, codebooks; what is essential is that each resulting digit is mapped to a set of groups having equal a priori probability and there are no collisions where a group is mapped to two different digits. We can come up with many such mappings, but a simple sum of the digits in the group, mod 10, does the trick just fine. This is what is done in BookPad.

In fact, in BookPad we are not adding three consecutive key text digits, which might not be independent of each other, but digits separated by the length of the encoded plaintext. It is highly unlikely that any correlation would extend that far in a real text, so we can safely assume that the digits to be added are indeed independent. Therefore, each digit after the addition potentially has as much unpredictability as three digits before the addition, potentially resulting in full random entropy.

If instead of cutting the text into three pieces we had cut it into two, which we then added together, we’d be talking about groups of two digits, each having entropy of 2 x 1.1073 = 2.2146 bits/group. If we map this to single digits by modular addition, each digit would have 2.21 bits of entropy, which is less than truly random, and it shows in the statistical tests.

But now, and here’s the kicker, if we add or subtract to this the encoded plaintext, which has the same average entropy per digit, the resulting ciphertext has a full 3.32 bits/digit of entropy. In other words, the ciphertext is statistically indistinguishable from a random string and cannot be decrypted except by brute force. This would be possible, however, because the key text is only twice as long as the plaintext and does not contain as much entropy as a random string of plaintext length, so it is possible to tell, in principle, whether a guess at the plaintext is correct or not. If it is correct, the keystream deduced by subtracting plaintext and ciphertext will have less than perfect randomness. If a triple-length key text, later divided into three parts and added together is used, however, brute force is of no avail either, for then one would always be able to find a key for every tentative plaintext one can think of, which combined with it will yield the ciphertext at hand.

On the other hand, it is easy to prove that simply adding digits does not conserve their entropy. Let us see this with a simplified example involving binary strings made of one and zeroes. Let’s say we have a particular string with exactly twice as many zeroes as ones. The the probability of a zero is 2/3 and that of a one is 1/3. Now let us cut the string in two pieces and add them without carries, which for binary digits is the XOR operation where: 0 XOR 0 = 1 XOR 1 = 1, and o XOR 1 = 1 XOR 0 = 0. Then the probability of getting a zero at a given position in the result is 2/3 x 2/3 + 1/3 x 1/3 = 5/9, and the probability of getting a one is 2/3 x 1/3 + 1/3 x 2/3 = 4/9. That is, the likelihood of getting a one or a zero in the sum is closer to the 1/2 of the perfectly random sequence, but not quite. What’s worse, cutting this into halves and XORing them together again does not get us there, because then the probabilities of zero and one are 41/81 and 40/81 respectively. In fact, we can never get to a probability of 1/2 for either digit because the denominator of a probability thus calculated will always be a power of three, which can never be simplified to 2 no matter what the numerator might be. The original Shannon entropy was -2/3 x log2(2/3) – 1/3 x log2(1/3) = 0.9182 bits/digit, so if the entropy were conserved with the operation, we’d get over the 1 bit/digit of the perfectly random string (except that it cannot get any bigger than this). Instead we get – 5/9 x log2(5/9) – 4/9 x log2(4/9) = 0.9910, which is close to 1.0, but not quite. In other words, the entropy per digit of the sum is greater than that of the parts being added, but it can never reach the entropy of a perfectly random string.

This is because summing strings without carries is an irreversible operation, which loses entropy. It’s like thermodynamic entropy, only backwards; in thermodynamics, irreversible processes generate entropy rather than destroy it. If we wanted to conserve the information entropy of a string as it is shortened, we should have applied a reversible process such as data compression. The problem is that this takes a lot of processing, which isn’t the best thing for a paper and pencil cipher.

The encoding from letters to numbers used in BookPad does indeed involve a bit of compression. After all, one of the good features of a straddling checkerboard encoding is that it uses single digits for the most frequent characters and double digits for the rest, thus providing some compression that is bound to concentrate the entropy into fewer digits. Indeed, for perfectly random text containing 26 letters, the entropy per character is log10(26) / log10(2) = 4.70044 bits/letter, while for perfectly random decimal digits the entropy is log10(10) / log10(2) = 3.32193 bits/digit, that is, we need 4.700044 / 3.32193 = log10(26) = 1.415 times more digits than letters to represent the same entropy. Another way to obtain the same result is consider the length of a long integer expressed in decimal digits or expressed in base 26 (A to Z, for instance). If the number has N decimal digits, all of them nines, its value is 10^N – 1 which is just one unit short of 10^N. The same number expressed in base 26 will span a number M of letters so that 10^N = 26^M, yielding N/M = log10(26) as before.

But typically a text encoded with the Arsenio checkerboard or one of its key-derived variants only takes 1.33 times more digits than letters (including the space). This is because one-third of the letters in a typical text are contained in “arsenio” and are therefore converted to single decimal digits, while the other third is converted to double digits. Consequently, checkerboard encoding increases the length of a typical text by a factor of 2/3 x 1 + 1/3 x 2 = 4/3  = 1.33, which is what we observe experimentally. Therefore, the encoding compresses the text by a factor of 1.415 / 1.33 = 1.064, making the result a 6% shorter than it would have been with a non-compressing encoding.

Intuitively, it doesn’t look like this small amount of compression could be responsible for the amazingly good randomness of the triple text combination, but in order to dispel all doubts I made a simplified cipher where no encoding to digits is used. I call it “TripleText”, and you can find a JavaScript implementation of matching this article in this page (the most current version is here). Then I ran the same tests on its output.

Instead of adding and subtracting decimal digits as in BookPad, we are adding and subtracting the letters A to Z, which are base 26 digits. These operations can be performed easily with the help of a Tabula Recta:

A | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | a
B | B C D E F G H I J K L M N O P Q R S T U V W X Y Z A | z
C | C D E F G H I J K L M N O P Q R S T U V W X Y Z A B | y
D | D E F G H I J K L M N O P Q R S T U V W X Y Z A B C | x
E | E F G H I J K L M N O P Q R S T U V W X Y Z A B C D | w
F | F G H I J K L M N O P Q R S T U V W X Y Z A B C D E | v
G | G H I J K L M N O P Q R S T U V W X Y Z A B C D E F | u
H | H I J K L M N O P Q R S T U V W X Y Z A B C D E F G | t
I | I J K L M N O P Q R S T U V W X Y Z A B C D E F G H | s
J | J K L M N O P Q R S T U V W X Y Z A B C D E F G H I | r
K | K L M N O P Q R S T U V W X Y Z A B C D E F G H I J | q
L | L M N O P Q R S T U V W X Y Z A B C D E F G H I J K | p
M | M N O P Q R S T U V W X Y Z A B C D E F G H I J K L | o
N | N O P Q R S T U V W X Y Z A B C D E F G H I J K L M | n
O | O P Q R S T U V W X Y Z A B C D E F G H I J K L M N | m
P | P Q R S T U V W X Y Z A B C D E F G H I J K L M N O | l
Q | Q R S T U V W X Y Z A B C D E F G H I J K L M N O P | k
R | R S T U V W X Y Z A B C D E F G H I J K L M N O P Q | j
S | S T U V W X Y Z A B C D E F G H I J K L M N O P Q R | i
T | T U V W X Y Z A B C D E F G H I J K L M N O P Q R S | h
U | U V W X Y Z A B C D E F G H I J K L M N O P Q R S T | g
V | V W X Y Z A B C D E F G H I J K L M N O P Q R S T U | f
W | W X Y Z A B C D E F G H I J K L M N O P Q R S T U V | e
X | X Y Z A B C D E F G H I J K L M N O P Q R S T U V W | d
Y | Y Z A B C D E F G H I J K L M N O P Q R S T U V W X | c
Z | Z A B C D E F G H I J K L M N O P Q R S T U V W X Y | b

To add two letters: find one letter on the top and the other on the left side; follow the column and row, respectively; the letter found on intersection of the two is the sum. Example: R + H = Y. To subtract two letters: find the first letter on the top and the second letter (the one being subtracted) on the right side; follow the column and row as for addition. Example: R – h = K.

The rest is very similar to BookPad. Here are the steps for encrypting:

  1. Remove all spaces, punctuation, and diacritical marks from the plaintext and the key text so only the characters A to Z remain.
  2. Measure the length of the processed plaintext and take a piece of processed key text three times as long.
  3. Split the key text piece into three parts of equal length and add them letter by letter using the Tabula Recta, which implements addition and subtraction mod 26. The result is the keystream.
  4. Now subtract the processed plaintext from the keystream using the Tabula Recta. The result is the ciphertext.

The process for decrypting a ciphertext is identical, except that the ciphertext replaces the plaintext in all steps.

If we take the 9000 true random digits that we used earlier and convert them to base 26, we obtain a string of 6360 presumably random letters (check: 9000 / 6360 = 1.415, which is the expected ratio for random letters to digits conversion). The statistical tests applied to this, slightly modified for base-26 letters rather than base-10 digits, give these results: chi-squared = 26.58113 (less than 34.4 means random with less than 10% chance of error), two-digit chi-squared = 720.0586 (less than 670.7 means random with less than 10% chance of error), Durbin-Watson = 2.0302 (should be close to 2.0), Shannon entropy = 4.69736 bits/letter (should approach 4.7 for perfectly random), runs at 1.9408 letters/run (should be close to 2.0). That is, all tests are a pass, except the two-letter chi-squared test, which does not “pass” at a less than 10% chance of error.

Does a real text cut into three parts and added together fare any worse? Should it? If the argument I made for the digits in BookPad is correct, then it seems it should, but by a narrower margin. Since the average entropy of text (spaces included, so if we take them out, the entropy should be a little higher) is measured at 1.58 bits/letter, then to get the same entropy as a random string of letters we need a piece of text that is 4.70044/1.58 = 2.975 times the length of the random string. Therefore, a triple-length string would still suffice, though not by much.

We again take the 1st chapter of “Oliver Twist” as plaintext and the 2nd chapter as key text, resulting in these statistics for the keystream: Chi-squared = 36.272,  two-letter Chi-squared = 725.51, DW statistic = 2.028, Shannon’s entropy  = 4.6949 bits/letter, runs at 1.9706 letters/run. In other words, it passes the same tests that the random text passes, and “fails” the one it fails to pass at the prescribed confidence level, with a very similar score. It is, as far as our tests can tell, indistinguishable from a truly random keystream. The cipher is not theoretically unbreakable because, even though the entropy of the key text is sufficient to guarantee this, some entropy is lost in the keystream-generation process, so the result can never be perfectly random. It does come close, though, and it only takes a piece of text that is three times as long as the random text we require, split into three parts of equal length, and added together with the help of a classic Tabula Recta.

Both the Tabula Recta (Trithemius, 1508 AD) and the running key cipher on which TripleText is based were devised back in the XVI century. Nothing prevented the cryptographers of that time from coming up with a nearly unbreakable cipher like TripleText, except perhaps scarcity of books to use as source material and lack of development of information theory (not much of it being needed, anyway). Had they invented something like this back then, history might have turned out quite different from the way it did.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.