Cryptanalyzing FibonaRNG

Sorry about the title. This post is motivated by Steven’s comments to the “What is Randomness?” post, where he describes a way that the current paper-and-pencil cipher champion, FibonaRNG, could be broken. Rather than responding with more comments, I thought a whole new post on the issue would make more sense, since it’s going to be rather long. For those who prefer the short version: yes, what Steven says would work, but not very well, although it looks like it should. Read on if you prefer the long version.

Let me begin by inserting what Steven (didn’t get a last name) said on the other post, with minor edits:

While yes, the continuation of the calculation is dependent on the characters in between the two characters and the third generated character, that’s not what I meant by a repeated state. While a short periodic repeat is essentially death for a PRNG, having a deterministic outcome of one part of the stream based on previous parts of the stream is also death for a PRNG. Any repeated pair results in a deterministically repeated output, even if there is stuff between the pair and the output.

Because of the birthday paradox math, it actually wouldn’t take more than about 30 characters of keystream to find a repeated pair. In your example, the pair WD occurs twice. If the message were just three characters longer, there would be a predictable V right there in the keystream, and that’s without constructing a tabula recta. And if a message were say, 500 characters long, then enough information from the plaintext would leak into the ciphertext through this property that we could likely determine the seed length from the ciphertext alone (assuming we know the language of the message). I could attack it essentially like I would attack a broken Vigenère cipher.

So even if the keystream appeared to be good, with no periodic repeats, there is still the fundamental property that it is deterministic from previous outputs, regardless of your seed choice.

The way I interpret this is the following: The lagged Fibonacci generator (LNG) step combines two consecutive letters from the keystream to generate a third, which ends up located a distance after the first letter equal to the seed length. Now, the LNG passes independence tests for any two keystream letters separated by any distance, but will it pass independence tests for groups of three letters? It doesn’t seem like it should, since any pair of consecutive letters forces a particular third letter a little distance away due to the way the keystream is being generated, and therefore those three letters are not independent. In other words, it should not be true that P(x,y,z) = P(x)*P(y)*P(z) for certain combinations of x, y, and z, where the function P(something) denotes the probability of encountering that something in the keystream.

One way to start breaking the cipher, therefore, would be this:

  1. Run a battery of tests of independence for groups of three letters, the first two next to each other, the third some distance down from them, which you change from test to test.
  2. When that distance is equal to the seed length minus one, the test should fail (and possibly also for the multiples of the seed length, minus one), but likely not for any other distance. This will tell us what the seed length is.
  3. Now that we know the seed length, cryptanalysis has begun. Steven would do something similar to attacking the Vigenère cipher, which essentially consists on getting a histogram on each group containing  all the letters separated by the seed distance; since in the Vigenère cipher those would have been encrypted with the same offset on a Tabula Recta, the original statistics of the language would still be present (as in “E” is most frequent, and so forth), which would quickly lead to the repeating key, and a complete solution. I don’t think this is going to work for FibonaRNG, where cipher letters separated by any regular distance do not use the same keystream letter, but I’m sure greater intelligences than mine will be able to find some prying leverage here.

In order to test this, I wrote a little JavaScript function that I can run as soon as the ciphertext is generated, using a variation of the FibonaRNG demo page that you can find here. This is the function:

function corrAtDistance3(array,base,distance1,distance2){
 var length = array.length,
     highIndex = length - distance1 - distance2,
     result = 0,
     operand,
     expected,
     freqTable = new Array(base);
 for(var i = 0; i < base; i++){
   freqTable[i] = new Array(base);
   for(var j = 0; j < base; j++) freqTable[i][j] = new Array(base).fill(0)
 }
 for(var k = 0; k < highIndex; k++){             //fill the table with data
   freqTable[array[k]][array[k + distance1]][array[k + distance1 + distance2]]++
 }
 for(var i = 0; i < base; i++){                  //each first character
   for(var j = 0; j < base; j++){                //each second character
     for(var k = 0; k < base; k++){              //each third character
       expected = freqArray[i] * freqArray[j] * freqArray[k]/length/length;         //expected P(x,y,z) = P(x)*P(y)*P(z)
       if(expected > 0){                         //in case a letter does not appear at all
         operand = freqTable[i][j][k] - expected;
         result += (operand * operand) / expected
       }
     }
   }
 }
 return result
}

Where “array” is the keystream or whatever, containing numbers from 0 to 25, “base” is 26, “distance1” is going to be 1, “distance2” is going to be variable, and “freqArray” is the histogram containing single-letter frequencies, previously calculated by a separate function. The function performs a Chi-square test on the hypothesis that P(x,y,z) = P(x)*P(y)*P(z), where x, y, and z are items separated by the given distances within the array. With the parameters given, the test passes with 10% probability of error for a value of 18000.

And now some tests. The function works by making a 26x26x26 array to count the frequencies of letter triads. There are 26^3 = 17576 such triads, so in order to get decent statistics we need a large chunk of text, at least 10 times that value. I fed it the first 10 chapters of “Oliver Twist,” gutenberg.org version, which produces a keystream containing 115110 characters. Key material was: Key 1 = “marvelous”, Key 2 = “wonderful”, Key 3 = ‘N’ (not used), Seed = “awesome” (length = 7). Result: the function corrAtDistance3(stream,26,1,6) returns a value of 2897569, which is a solid fail of the test applied on the keystream, whereas the returned value is 17923 (pass) when the last parameter is 5, and 17918 (also pass) when the last parameter is 7. This is precisely as expected, and reveals the seed length quite well.

Ah, but his was if the test is applied to the keystream, which is not the same as the ciphertext. A reasonable expectation (I believe Steven would agree with this) is that such strong correlation within keystream triads would still remain in some way after the keystream is combined with the plaintext in order to produce the ciphertext, since the plaintext is itself far from random. Here are the numbers resulting when the same test is applied to the ciphertext: when the second distance parameter is 5 (seed length -2), the test returns 17455 (pass), when it is 6 it returns 18334 (fail, but not by much), when it is 7 it returns 17336 (pass). In other words, the correlation is still there, but it is barely detectable after the keystream combines with the plaintext. This was with a particularly “bad” initial random seed; in a majority of cases the result returned for all distances is below 18000. For shorter texts, the probability of a pass increases and it is all but assured for less than 65,000 characters (5 chapters of “Oliver Twist”). And who encrypts messages this long just by hand?

In other words: it should work in theory, but it doesn’t in practice because we won’t have the keystream even if we do have the plaintext. Separating the two is likely equivalent (got to check this) to finding the two permutations of 26 elements involved in combining plaintext and keystream, which is a harder problem than finding the seed length.

So perhaps it would have worked if there hadn’t been any substitutions (or possibly just the second, controlled by Key 1). It turns out, however, that we get the same results if we use straight alphabets on the Tabula Recta. It seems that the triad correlation is so weak by itself that just adding common text manages to dissolve it down to undetectable levels.

If we apply the same test to the plaintext, we obtain values for the first 10 chapters of Oliver Twist that are around 90000, decreasing slowly as the distance between the first two letters and the third increases, since the result is mostly controlled by the strong correlation between the first two letters. The plaintext fails the test of independence, as it should, but not as badly as the keystream, and it does not exhibit the same sharp peak of dependence when the distance between the first and third letters involved in the test is equal to the seed length. After this is combined with the keystream using the Tabula Recta, their very different patterns of dependence seem to fight, rather than reinforce, each other with the result that the ciphertext does not longer show a detectable dependence, at least for messages not inconceivably long.

Does this mean that FibonaRNG cannot be broken even by computer analysis? By no means, but certainly we must be cleverer next time.

2 thoughts to “Cryptanalyzing FibonaRNG”

  1. That’s a nice analysis man! I totally agree on a lot of it! And yes, it would be very difficult get the information out of the cipher text, however, the math I would use would be a bit more complicated.

    I would to a triad analysis on the expected language and use the data from that to combine with the triad analysis on the cipher text (frequency of “similar” letters being those distances from each other combined with value ranges, etc. techincal stuff). This is why I specified knowing the language would help the best.

    However, this is a wonderful analysis! Great work man! I look forward to more!

  2. Could FibonaRNG be used for the purposes of helping obfuscate a locally stored BIP39 seed phrase for a crypto wallet? A BIP39 seed phrase is a set of 24 words [1] which can be used to represent a randomly produced 256-bit key. It is used by various crypto wallets, such as Ledger and Trezor products, and is widely supported by other wallets.

    These words are used to derive the root key for the wallet and from there all subsequent deterministic wallet addresses. Thus the set of words, in the order given, are critical because they are the only consistent means a person has to be in control of their wallets.

    The person has to write down and store these 24 words, typically on supplied cards. The idea is that this prevents (or limits) them from being displayed on a computer screen, they have not been typed, they have not been photographed or screenshotted. By keeping them on a card the risk and footprint is tightly managed at that point. People can then expand that risk as they see fit, and this is where it gets messy.

    How do we keep this top secret word list private? There are many bad ideas, based on differing opinions of risk and threat scenarios, as well as unfortunately a whole heap of uninformed opinion, half understood technical ideas, dogma and ego and ‘received wisdom’, especially in many forums. This has always been the way with online crypto discussions, I have found since the 90’s.

    Users will often buy one of these steel wallets. Why? It can help protect against destruction of the paper by fire, but it does not protect against, say, theft, and perhaps increases the risk of opportunistic theft. It also introduces new risks which are not present with paper.

    People will store their phrases in a password manager. Is that okay? Maybe, more likely maybe not. It expands the fence around the original small paper risk to include the phrase once again going back onto a computer via a keyboard and secured in some other way. It adds a dependency on software. It probably introduces an off-site element to the risk. If the person is technically clued up then they may be happy with this risk expansion, versus the risk of securing a piece of paper from theft or destruction long-term.

    Some people will take a photograph. Now it is on a smartphone, scattered around its filesystem and likely being processed on a cloud storage service and having content analysis on it. The fence around the risk has become very wide and messy with a lot more weak potential.

    Then we have vendors going out of business, software EOLing and so on. Despite all this, people can still feel very unhappy about having to babysit a piece of paper or a piece of metal.

    Out of interest I have been looking into ways to obfuscate the phrase so that it can be stored locally, managing the above risks as before, while adding some complexity to its recovery. This complexity would help grant some time to react where needed so that funds can be moved to new secure wallets.

    Starting with a set of 24 BIP39 words from [1] below, and on the assumption that that point represents a known good safe checkpoint from which to start:

    What it needs to do:
    1. Permit those words to be stored offline and not rely on any scripts or software to recover them
    2. Be able to operate using pen and paper, eg something like your FibonaRNG
    3. Prevent casual disclosure of the seed phrase, which is what happens immediately if someone takes the paper or metal solution, by definition since that is the purpose of the paper or metal
    4. Allow the storage to still be done simply using those kinds of items, eg written or engraved groups of characters
    5. Allow a phrase(s) or keyword(s) to be used with the encoded text and knowledge of the approach used to recover the original words
    6. Require someone who does not know these things to have to expend some resource to analyse and attack it and reverse engineer it.
    7. Avoid security theatre and pointless steps which will be forgotten – the complexity essentially “burying the wallet in the desert without a map”, as someone nicely described it.

    What it ideally should do:
    8. Have some resistance to a known plaintext attack, given that the full set of words are known in [1] below.
    9. Perhaps take advantage of the relatively small plaintext, eg approx 96 characters is all we have to store or attack
    10. Fractionate and smooth out letter frequency analysis

    What it is not doing (inspired by the kind of comments this sort of discussion always attracts):
    11. Acting as a long-term genius DIY never-seen-before crypto solution to thwart James Bond and state-level attackers.
    12. Replacing Shamir Sharing or BIP39 passphrases or BIP38 local encryption or anything else offered. They have their own use-cases and their own risk profiles and they may or may not intersect with this seed phrase clear text storage problem, and in different ways, on a per-person basis depending on their risk profile, threats, specific wallet requirements and capabilities, technical skill, etc.
    13. Replacing the need to physically store the encrypted paper or metal in some secure manner. It still needs to be managed the same way and that always was and is its primary means of being kept safe offline.
    14. Preventing loss from damage or destruction. That’s not its job.

    Summary

    The recovery seed phrase is just another piece of information which represents value and risk and must be managed. As long as the limits and risks are understood then it is perfectly acceptable to use an approach which adds some work to the recovery process. In the above use case I want to stay as loyal to the original recovery paper as possible. So keep everything offline and in the same domain as the original pen and paper which prevents casual disclosure and which adds a time element to discover the process and perform the work to recover the phrase.

    Whatever we can say about such an approach it is clear that it is at least as strong as storing the seed phrase in the clear.

    It strikes me that FibonaRNG could be a good approach for this. It is simple enough to reproduce and with a small enough quantity of text to add friction to cryptanalysis. Certainly more friction than simply looking at the plain text written out with no work needed at all. But for the legitimate owner it’s relatively simple to get back to the 24 word seed phrase.

    I would be interested to hear the blog author’s thoughts on this problem, perhaps as a new blog post.

    Thanks for your blog and articles and interest in pen and paper ciphers over the years.
    Chris

    References

    [1] BIP39 English word list (first 4 letters are unique in each word) – https://github.com/bitcoin/bips/blob/master/bip-0039/english.txt

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.