Introducing BytePad: simple, decently secure encryption for files

Quite some time ago, I wrote an article discussing the practicability of using a file as a key for encrypting other files. As I am preparing a talk on what can you possibly do if you find your computer compromised, I thought I’d optimize the tool in that article. A couple weeks later, the result is considerably faster and more powerful than the original. At least, enough to warrant a new article. So here it is.

The original FilePad was a sandbox where you could input a file, extract its single bits, and then operate them in different ways in order to produce a stream of bits that would pass tests of randomness. The idea behind was that, since the input file contained real entropy, the output stream would also contain real entropy and would be better than the output of a pseudorandom number generator so long as it could pass randomness tests. This still would not be a true one-time pad, mind you, because this would require one bit of entropy per bit and the initial file normally would fall short of that, but it might be “decently” close to it for practical purposes. I put the word “decently” between quotes because I don’t know enough cryptography to be able to say what “sufficiently” close might be, or even “close enough” for use. My apologies beforehand. But I’ve tried to test the “decency” of the code through some attempts to break it.

The original FilePad used several simple algorithms:

  1. The Von Neumann extractor: look at pairs of bits; if the bits are the same, throw them away and move on; if different, write down the value of the first bit and move on. This method makes statistically valid randomness from the output of a biased generator, such as a bent coin, provided the bits are independent of one another. Unfortunately, the output is only one-fourth the length of the input.
  2. Which brings in the second algorithm: binary inflation. Take bits in groups of two and append to the stream the result of some operation between those. This process extends the initial stream to double the initial length minus one. The second half, of course, is totally predictable given the first half. This process is similar to, but not the same as the inflation often combined with a Von Neumann extractor, which uses only the bits that are not used by the extractor because they are repeated.
  3. Transposition, taken from classic cryptography: write the bits as a table making a series of rows, one row at a time; then read off the bits by columns. This will not change the values, but will separate bits that, because of their initial proximity, might be less than independent of each other.
  4. Lagged Fibonacci Generator (LFG). Do an operation with bits that are a certain distance apart from each other. In my implementation, I took the last bit and xored it with the first in order to get the first bit of the output, which I would write down right below the initial bit stream. The remaining bits were obtained by xoring the input and output bits immediately to the left. As in output(i+1) = input(i) xor output(i).

Recent testing has shown that the above LFG algorithm is quite good at improving the statistical randomness of the input bit stream, but it introduces a correlation between output bits since each bit depends directly on the previous. This correlation can be lessened by doing a transposition. The Von Neumann extractor reduces the input a lot, so much so that inflation steps, which are really a sort of duplication, cannot effectively preserve the initial length. The testing also showed operations with single bits to be inefficient in JavaScript, which is the language I used for the code.

The result of the subsequent development is BytePad, which is the original FilePad with a few key differences:

  1. The smallest binary chunk that JavaScript can process efficiently is a byte = 8 bits, so now operations are done on whole bytes rather than bits using special binary arrays, making them much faster.
  2. Because the process is much faster, the file size it can handle is also a lot larger. You can process files hundreds of megabytes in size.
  3. Operations are simplified: only LFGs and transpositions, plus xor of byte arrays. Gone are the Von Neumann extractor, which doesn’t really work with bytes, and the inflation algorithm.
  4. New ways to visualize the result.
  5. Additional tricks for enhanced security, such as Russian copulation (begging your pardon), and message authentication codes. The LFG is based on sum rather than xor.

The BytePad page itself explains the process, so I’ll only comment on a few things here, more or less in the same order as they appear on the page:

  • The key file usually will have a size that is different from that of the plain file that is to be encrypted. If smaller, it will be repeated until its combined length is greater than that of the plain file; if larger, it will be left as it is, but this may mean too much useless processing. This is why there is an option to cut the key file (after repeats, if that was necessary) to the size of the plain file before much processing takes place.
  • The picture that appears when you load a file to be used as key or when you press the Execute button is a graphical representation of the key file and of the resulting keystream file, respectively. Bytes are converted directly to green, red, and blue values for the pixels, in strict sequence, scanned horizontally from left to right. Very often, it is quite easy to see it in the image when a file is not random. Sometimes you may see a faint grid pattern that is only the result of display aliasing. Resizing the page (Ctrl+ or Ctrl-) will remove this artifact. If you like the visualizer, you can use it separately from BytePad through this file. By the way, the picture at right is what a PDF of “Hamlet” looks like before being randomized. Further below you’ll see it after processing.
  • Block size can be any number, but very small values (below 16) may produce repeating patterns with key files of poor entropy. Large values may leave a poorly randomized first block that can be exploited by an enemy.
  • The initial sequence’s main purpose is to generate some pseudo-randomness when the key file has very little. It is, therefore, not necessarily a secret and this is why it has a default value. It can also be a lot longer than a single number. You can also use it as an additional key or password if kept secret between the sender and the receiver. In this case, security is further enhanced if it is used for randomizing each block, which is what the “Whitening” checkbox does.
  • The point of the “cut” (as in a deck of cards) on the key file is to be able to re-use the key file multiple times, so long as a different number is specified each time. Both sender and recipient must use the same number or decryption will fail. This number normally won’t be a secret since it needs to change for each new encryption with the same key file. It is expressed as a percentage of key file length and therefore an eavesdropper who does not know the actual length of the key file, which is secret, still won’t know the precise location of the cut even if the percentage is transmitted in plaintext.
  • There is also a cut on the plain file, and again on the final encrypted file (reversed, so the process is identical for encryption and decryption) but for a different reason. This is sometimes called . . . ahem . . . “Russian copulation”. Here we want to keep the material to be encrypted from starting with a predictable sequence of bytes, which will happen if the start of the file is a header that will be similar for other files of the same kind. This one does not need to be different for new encryptions and it is okay, even recommended, to keep it a secret between sender and recipient.
  • There is a process to generate message authentication codes (MAC) for the plain file and the decrypted files. This calls an algorithm very similar to that used for randomizing the key file (LFG, followed by transposition, and another LFG), except that we feed the plain file (after cut) instead of the key file into it, the result of the previous 32-byte block is xored rather than prepended, and the key file (after cut) is also xored before the second LFG. The result is a 32-byte block that gets reduced to 16 bytes by splitting it down the middle and xoring the halves (after transposing the first half), then converted to hex numbers for display. This code is to be transmitted in plaintext along with the encrypted file. If an enemy intercepts and alters the encrypted file so it will decrypt to something else, the MAC of the decrypted file won’t match the MAC of the original plain file. The enemy will have a tough time coming up with a MAC that works with altered contents, because the original key file, which he/she/it does not have, is involved in making it.
  • It turns out that both the LFG operation and the transposition are reversible, so I have included a set of buttons to do the process of randomizing the key file bytes in their normal forward direction, or reversed. You can, for instance, randomize a key file, which can then be downloaded from the keystream file box, load this as a key file, and de-randomize it by running the reverse process in order to recover the original key file. This might look self-defeating but I have it there to bring home the point that the fact that something looks random does not mean it is random, and an enemy would be able to recover your precious key file quite easily unless further precautions are used.
  • The main precaution will be to keep checked the box next to the Execute button. This causes the keystream file to be split in two, does a transposition on the first half, and xors it with the second. Since the parts come from different material in the key file, they are expected to be uncorrelated for an initial file of decent entropy. When you xor them together, they are in effect encrypting one another, so that you need to have one of them in order to get the other. Below I discuss in more detail how this thwarts cryptanalysis.
  • There is a checkbox to perform a series of simple randomness tests, such as Chi-square and Shannon entropy. Doing so increases the processing time, so uncheck this box if you are processing large files and you are pretty sure the result has good randomness. The resulting keystream will be displayed as a picture in any case.
  • There are radio buttons that will apply the Deflate lossless compression algorithm, or its reverse, to the key file and the plain/cipher file, code courtesy of Imaya’s Zlib.js. Compression concentrates the entropy of the original key file into fewer bytes, thus making the result even harder to crack. Compressing the plain file makes the encrypted cipher file smaller.

The app has Save buttons for two files: 1, the randomized key file that was used for encryption (cut to the same size as the plain file if so selected with the radio buttons near the top), and 2, the encrypted file, which would be the decrypted plain file if the encrypted file is loaded into the plan file box. Statistics are displayed only for the randomized key file, which is the one file for which these may be important. These statistics are very simple, so if you want to really test how good the randomization is, you can randomize a relatively large file (12 MB may be enough) and feed its randomized form into a testing suite such as Diehard (download link for Windows). The randomized files I generated also passed every one of the Diehard tests, as well as the NIST-STS 1.6 tests, so the randomized key files are expected to provide very good security when xoring a plain file into them.

But, as I said above, that security may be only apparent. I think it goes without saying that reusing a key file without changing the “cut” value loses all security. This encryption method is a stream cipher at heart, which is only secure if the keystream is different for each new message. Doing a cut early in the process ensures that this will happen (unless the key file has very poor entropy). It is okay to transmit the value of the cut, which can be a decimal, quite in the open. Think of it as a sort of message serial number. If you forget to do this, though, the app includes a trick that will do a similar thing automatically, so long as you encrypt files of different length. It is described below.

If you uncheck the Non-reversible box, the randomized key file is perfectly reversible back to its plain form (and I even supply the algorithms to do it). This is what an enemy would do in a “known plaintext attack”. He/she/it only has to xor the plaintext out of the ciphertext, and the keystream is open to be inverted back to the plain key file. Then the enemy knows my source and can decrypt everything I encrypt with that file, even after cuts and whatnot. He/she/it may even guess the source of my key files, and then I’m really smoked.

Even with the Non-reversible box checked, it is not obvious that I will have complete security, because of the linearity of the algorithms employed, which make them transitive with other operations in some cases. The transpose algorithm, for instance, is transitive with the xor operation, which means that given two byte streams of the same length A and B, transpose(A xor B) = transpose(A) xor transpose(B). The LFG function based on xor is also transitive with the xor operation, but LFG based on sum is not, and this is one of the reasons why the LFG used in BytePad is based on sum rather than xor. The other reason is that the period of the resulting byte stream in the absence of input entropy, is a lot longer for small block lengths. But suppose we had used a transitive LFG. Then, when we split the keystream in two and xor the two halves A and B together without further operations, we’d get: A xor B = f(a) xor f(b), where f() stands for the succession of first LFG, transpose, and second LFG applied to the plain key file that generates the keystream, and a and b are the respective portions of the key file. If we now apply a reverse LFG, reverse transpose, and reverse LFG in succession to the supposedly non-reversible keystream, since they are all transitive with xor we get:

reverseLFG(reverseTranspose(reverseLFG(A xor B) = f-1(A xor B) = f-1(A) xor f-1(B) = a xor b

And the result is the combination of two plain texts, which can be attacked by frequency analysis and other traditional techniques in order to retrieve the original a and b. This is why more tricks are added in order to enhance the non-reversibility of the keystream (which an enemy can always obtain through a plaintext attack), in addition to using the sum version of LFG, which is not transitive with xor. Here they are:

  1. Split the key file into blocks to be randomized, and feed the result of the previous randomization to the next. Then, the resulting blocks before the final xor operation would be: f(a1), f(A1 xor a2), f(A2 xor a3), and so forth, for an initial block sequence a1, a2, a3…. If f() is transitive with xor, this can still be attacked after splitting in two and xoring the halves, since then the resulting blocks would be the invertible result of xoring four plan file blocks, which can be reduced to the xor of three blocks by further combination with xor (it would be too long to show it here). Three pieces of plain file xored together can still be attacked by frequency analysis and other techniques, although it is quite a bit harder, as discussed in this other post relative to combining three pieces of English text. This is why the previous randomized block is not xored at the start of the randomization for the next block, but after the first LFG, so the closest an attacker can get to the plain file is the xor of a couple plain blocks to the reverse LFG of a third, which is much more random-looking than the plain block, making traditional cryptanalysis very difficult.
  2. Take bytes starting from the end of the key file (after cut) rather than the beginning. This way, when a key file chunk containing just the right number of bytes is taken in order to encrypt a plain file, this chunk will start at different locations for plain files of different lengths, ensuring that the first block to be randomized is different, and thus all that follow. In case the user forgets to use a different cut percentage for different encryptions, the app does a similar thing automatically.
  3. Do an operation with one of the halves before xoring it with the other half. Even if that operation is transitive with xor, an attacker would end up with the xor of plain text and something that has been processed through LFG or transposition, or their inverses. This would be harder to attack than a xor of plain texts. In the app, a transposition is done on the first half (the whole thing, not by blocks) before xoring it with the second half, because the last previous operation for either half is a LFG.

One way an attacker may be able to break BytePad encryption is if there is a way to tell when a brute force attempt is “getting closer” to the solution in some way. This is what we have seen in movies countless times: the secret key is found one piece at a time through a smart algorithm that somehow knows when that piece is correct. So I took “Hamlet” and encrypted it with itself as key file (why waste good literature?), with the non-reversible tick checked and default parameters in all the boxes. Then I attempted to decrypt the result with one of the parameters being slightly different. Result: decryption failed completely, yielding files that looked fully random, if only one bit was different in the the initial sequence or the key file (except for file of very low entropy), or one of the cuts took place one byte before or after. It seems, therefore, that you can be just one bit away from having the correct key file and other parameters and not have a clue that this is the case. So much for the Hollywood method.

I want to finish this post with one more thought about security. BytePad manages to turn a file made completely of zeros, and therefore containing no entropy whatsoever, into something looking as random as the second picture of Hamlet. This is because the LFG algorithm by itself is a pretty good pseudorandom number generator. It is not in common use because the quality of the resulting keystream tends to vary unpredictably with the seed (which defaults to a single “1” byte in our case). This is not a problem here because the strength of the algorithm is not based on the generating quality of the LFG by itself, starting from a relatively small seed, but on its ability to confuse entropy-containing bytes together so the result has better statistical randomness. The core process does not add or remove entropy since it is reversible. By the way, have you stopped to think that pretty much every encryption algorithm out there, including the standard AES and other famous ones in which we trust our digital lives every day, start from a very short piece of key, typically at most 256 bits long (32 bytes)? This means that the keystream they generate (many are actually block ciphers, but you get the idea) only contains that much entropy altogether. While BytePad goes through a few steps in order to randomize the key file, most ciphers involve at least ten rounds of a complex permutation-substitution operation. They get their strength partly from having so much computation to get the key from the result, that it actually would take less computation to test every possible key. Rather than adding more computation for each new block, BytePad simply adds more real entropy from the key file.

How much? This depends on the type of file. You get a rough idea when you load the key file on BytePad and see its graphic version. The more random-looking, the higher the entropy. As I discussed in this other post, English text contains an average entropy of 1.56 bits/character. Each character occupies two bytes, or 16 bits, when we load the file into BytePad, so we end up with 1.56 / 16 = 0.0975 bits of entropy per bit of file, or 10.25 times less entropy than a truly random file. But this is large compared to the entropy in a conventional encryption method, which is at most 32 / bytes generated. With a mere 1kB you are already putting in 32 times less entropy than random. 256 bits is still considered very secure for conventional encryption, so using a text file as a key, although 10.25 times less “secure” than a truly random one-time pad (10.25 / 2  = 5.12 times less secure if doing the final xor as you should, because you use a double-size chunk of key file), has the potential to be much, much more secure than a conventional encryption method, provided the process is done correctly so that the original key material is obfuscated to the point that traditional cryptanalysis cannot get a chink to pry on.

Now, if you have a chimp, that’s a whole another matter, especially a smart one like that in the top picture. At the very least, he’ll crank out perfectly random versions of “Hamlet”, one after the other. The rest of us will have to use BytePad.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.