You’ve heard it before: “This is as likely as a monkey sitting on a typewriter writing Shakespeare.” It sounds very unlikely but . . . how unlikely, exactly? In this article, I go through the math and use the result to estimate how likely it is for life to have arisen spontaneously out of a primordial soup of chemicals.
I won’t bore you with how this whole trope of monkeys and typewriters came about. It has been used for quite a long time (that’s why typewriters, not laptops, are mentioned) and Wikipedia has a couple excellent articles on it. I’d start with this one. Succinctly, the most popular version says that an infinite number of monkeys typing randomly for an infinite time can produce any work of literature, because the probability of getting it right, even if small, is not zero, and they have unlimited time. The original version, attributed to Émile Borel, referred to the probability of a substance departing momentarily from its most probably thermodynamic macrostate, even for a short time. Borel described it in terms of thousands of monkeys typing complete libraries in his 1919 essay “Mécanique Statistique et Irréversibilité“.
But again, can we get a number, please? Here’s my shot at calculating the probability of our simian friend typing the Gutenberg.org version of “Hamlet.”
After removal of its header and the legal gobbledygook at the end, the Gutenberg.org text version of “Hamlet” contains 144048 non-space characters, 172957 including spaces, 3266 paragraphs including 4690 lines. Characters include all 26 letters, small case and uppercase, plus these punctuation signs: !()-;:'”,.? totaling 11 additional characters, plus spaces and newlines (not counted). We understand that spaces will be necessary, as well as carriage returns at the end of each line, for a total of 172957 + 4690 = 177647 characters. We have to discount 384 underscores that are unnecessary. Brackets are assumed to be identical to parentheses. There is one Latin diphthong, to be split as two letters. No numerals. Thus, the total character count is: 177647 – 384 = 177263. The number of different characters, including spaces and linefeeds, is 26×2 + 11 + 2 = 65, but this is not going to matter for our calculation.
My American English keyboard contains 46 main keys, plus 46 more after shifting, plus space and carriage return, for a total of 94, including all the above mentioned characters plus some that don’t appear in Hamlet. Every time the monkey presses a key, it has to be the right one, so that gives a probability 1/94 to the power 177263 for “Hamlet” to be the result.
My calculator overflows trying to compute this number, so here’s a trick. I can multiply and divide by 100, so the number becomes (100/94)^177263 * 100^(-177263). Unfortunately, the first factor is still indigestible to the calculator, so we need to split it, this way:
((100/94)^1000)^177 * (100/94)^263 * 10 ^(-177263 * 2) = (7.44983*10^26)^177 * 11678162.2 * 10^(-354526) = 7.44983^177 * 10^4602 * 1.16781622 * 10^7 * 10^(-354526) = 2.3437224 * 10^154 * 1.16781622 * 10^(4602 + 7 – 354526) = 2.73708 * 10^(-349763)
Another way to get the number is through the use of logarithms. The decimal logarithm of the number we’re looking for would be – 177263 * log10(94) = – 177263 * 1.973128 = – 349762.562713. Raising 10 to this exponent we obtain 10^(1 – 0.562713 – 349763) = 10^0.437287 * 10^(-349763) = 2.73708 * 10^(-349763), same as before.
In linear form, this would be zero, point, then 349762 zeros, and then 273708. Quite a bit to write. If we assume that one page of printed numbers will have roughly 1250 characters (250 words * 5 characters per word), it would take 280 pages to write it, all but the last filled with zeros, making a book longer than “Hamlet” itself.
But maybe we can get a clean copy faster with a whole bunch of monkeys teaming up to produce it. Assuming we have a monkey sitting on every atom in the universe (10^80 of them according to recent estimates), cranking out a full copy one million times per second (these monkeys are pretty fast typists, thanks to an endless supply of Coca-Cola), it would take 1/2.737 * 10^(349763 – 80 – 6) = 3.654 * 10^349676 seconds, or roughly 1.56 * 10^349669 years, which is about 8.4 * 10^349658 times the age of the universe (currently estimated at 13.77 billion years). Sorry, teamwork. I don’t think anyone, not even the Coca-Cola company, would invest in this venture.
So, yes, this is a very small probability. But it is still rather large compared to the probability of a tepid cup of coffee getting hotter where you drink it (at the expense of getting cooler away from that spot) for only a microsecond, which is what Borel wanted to illustrate. This is not impossible, but it would take a whole army of monkeys on typewriters cranking out all the major works of Western literature without flaw or error.
Now that we have a number and a process to arrive at it, maybe we can apply this to some other situation of interest. One of them is the origin of life on earth (or wherever, if life came here riding an asteroid, as some have suggested). Granted that Darwinian evolution can cause living beings to differentiate and become more sophisticated over time, there is still the problem that the first living organism, which cannot possibly have evolved from a previous non-living entity because evolution requires life, seems to have been itself quite complex (note: there’s speculation about “evolution” of molecules in the non-living prebiotic world, but this is not Darwinian evolution). Ongoing research on the subject has posited the existence of a single-cell progenitor of all life currently here, named Last Universal Common Ancestor, LUCA for short. Now, this may or may not be the first cell. It seems, however, that the lineage of all other contemporary competitors of LUCA has become extinct. The estimate is that LUCA’s genome contained at least 355 genes, because this is how many different proteins seem to have been present in its tiny body, all of them having survived, more or less mutated, to our day. It seems that LUCA had DNA (could have been RNA, but this makes little difference) hundreds of thousands of base pairs long.
Perhaps a better estimate of the first living cell’s complexity can be obtained by removing genetic material from a modern cell and seeing what we can get away with before the cell, or its lineage, is no longer alive. Of course, it is not likely that we would chance into anything resembling the historical first living cell by following this method (among other things because the earth’s environment back then was very different from what today is an environment friendly to life), but maybe we’ll get a good estimate of its complexity, since the first living cell had to perform pretty much the same functions in order to stay alive and pass on the trick to its descendants. There’s an excellent Wikipedia article on Minimal Genome that discusses the history of this effort, which has been going on for several decades. The current minimum genome champion, a synthetic cell named JCVI-syn3A, has a genome consisting of 543 kbp in 493 genes. The logic unit here is the “base pair” (bp for short), which is equivalent to 2 binary bits because the base pair language is base-4. Here our monkey would have a typewriter with only four keys: A, G, C, T, corresponding to the four bases. The probability of getting the syn3A genome right in a single typing session is therefore 1/4^543000. The decimal logarithm of this probability is -log10(4) * 543000 = -0.60206 * 543000 = -326918.575291. Raising 10 to this power in order to reconstruct the probability we get 10^(-326918.575291) = 10^(1 – 0.575291 – 326919) = 10^0.424709 * 10^(-326919) = 2.659 * 10^(-326919)
This is actually a lot easier to type than a singe copy of Hamlet. How much easier? 2.73708 * 10^(-349763) / 2.659 * 10^(-326919) = 1.0294 * 10^22844 times, that is, gazillions and gazillions of times easier. Which isn’t saying much, actually. Because if we put our team of highly caffeinated primates on this task, they still take quite a while. To be exact, 1/2.659 * 10^(326919 – 80 – 6) = 3.3761 * 10^326832 seconds, or roughly 8.655 * 10^326814 times the age of the universe.
These chances are not unlike those of getting a car to reassemble from its loose parts by throwing them in a bin and shaking. Except that car parts might snap together one by one when they align correctly, whereas there’s no good reason why segments of a partly correct genome might remain in their happenstance positions in the absence of natural selection rewarding them.
Smaller genomes than that of JCVI-syn3A occur in nature. The smallest known bacterial genome is 112 kbp, and even smaller genomes have been found for intra-cellular organelles and non-cellular organisms: 16.6 kbp for mitochondria, 1.8 kbp for porcine circovirus and around 200 bases for viroids and virusoids.
Among synthetic genomes, Spiegelman’s Monster has 218 bases and Evolutionary Product 1 (EP1) just 48: https://link.springer.com/article/10.1023/A:1006501326129
As you know, the time it takes for your highly caffeinated primates to type out a correct genome varies enormously according to genome size: even a genome as small as 200 bases would take them many times the age of the universe, whereas they’d type out a correct genome of 150 bases in less than 6 hours.
Of course, none of these organisms with a couple of hundred bases or fewer can self-replicate, but if you add an appropriate enzyme to replicate them, then natural selection kicks in and you no longer have to rely on random chance.
Finally, your analysis assumes only one correct genome for a given size, which clearly isn’t accurate. In terms of the monkey analogy, there’s no need for them to reproduce Hamlet – any comprehensible story, however simple, will do. It could even have a fair few typos.
(Follow-up reply sent via email, added by prgomez):
The bacterium with the smallest know genome is Nasuia Deltocephanicola: https://en.wikipedia.org/wiki/Nasuia_deltocephalinicola
However, it’s an obligate endosymbiont of leafhoppers, so it has the same problem as viruses, viroids and virusoids that it can’t survive and reproduce on its own. As far as I can tell, all the smallest bacterial genomes seem to belong to symbionts.
Now that I think of it, perhaps the most likely candidate for the first organism isn’t a tiny genome but rather a somewhat larger genome with a very small number of essential bases or base pairs. For example, if we assume that the 543 kbp of JCVI-syn3A is roughly the minimum number of essential base pairs for a free-living bacterium, wouldn’t it be more likely that these base pairs first formed within a genome of, say, the 4.6 Mbp of E. coli? In that case, to use the monkey analogy again, 12 % of a book would contain a comprehensible story, while the remaining 88 % would be meaningless gobbledegook. The story would be split into many short snippets interspersed randomly throughout the sea of gobbledegook. Surely that would give the monkeys a better chance.
The other thing that increases the probability is the fact that in your 1/4^543000, the numerator shouldn’t be 1 but rather the number of combinations that produce a viable organism, which might be quite high. The biggest question is whether the first “viable organism” was something like a bacterium that could self-replicate or something much simpler like a viroid, a virusoid, Spiegelman’s Monster or EP1 that arose together with a suitable enzyme for replicating it.
Thanks for the additional references. As you well note, Nasuia and the like are not beings able to live on their own and, consequently, not good first-cell candidates because of the lack of a host in historical first-cell conditions. Certainly, there was no E. Coli or similar back then, so your proposed mechanism puzzles me. I like the numerator idea better, and we don’t need an exact number; just an order of magnitude would suffice. Let me take a shot at it right here.
Suppose the genome of a minimum-genome being can be as varied as that of fully evolved organism from the present, and further suppose that every individual living today has a different genome (a huge overestimation, but let’s follow it). According to this source, we have 5 * 10^30 bacteria living today. Say that archaea and eukaryotes are just as abundant; that gives us 1.5 * 10^31 variations that would go in the numerator of the last probability formula, replacing the 1. This factor will end up dividing the previously calculated time required by our simian team. The time required now to hit upon the genome of one living cell capable of further evolution becomes 2.2507 * 10^326801 seconds, or 5.77 * 10^326783 times the age of the universe.
This still seems to me heavy odds against the spontaneous generation hypothesis. Other arguments I can think of in order to salvage it:
(Some of the previous comment appears indented below)
It seems I didn’t explain this part very clearly. I assumed an arbitrary genome size of 4.6 Mbp for a hypothetical first cellular/self-replicating organism. I took this value from the genome size of E. coli just as an example. I’m not saying there was any actual E. coli back then.
The 543 kbp of JCVI-syn3A seems to be roughly the minimum number of essential base pairs for a cellular/self-replicating organism. As you show, the probability of 543,000 base pairs spontaneously arranging themselves correctly is as close to zero as makes no odds. However, the probability of 4.6 million base pairs arranging themselves such that 543,000 of them are correctly arranged, while the remaining > 4 million are just randomly arranged, must be a fair bit higher.
The synthetic organisms Spiegelman’s Monster and EP1 don’t need a host, although they do need an enzyme to replicate them (RNA replicase and a combination of HIV-1 reverse transcriptase and T7 RNA polymerase, respectively).
I disagree. The “champion” might have out-competed its neighbouring RNA molecules by being easier for the “helper molecules” to replicate quickly and accurately, or by acquiring the ability to produce its own, highly specific “helper molecules” that replicated the “champion” without readily replicating neighbouring RNA molecules. A membrane or other sort of container may have evolved much later, after the original “champion” had long since out-competed its competitors, thereby begetting a new “champion”.
Most non-cellular and unicellular life does behave rather like cancer cells. Here are some quotes from the abstract of https://cancerdiscovery.aacrjournals.org/content/11/8/1886
“The breakdown of multicellularity rules is accompanied by activation of “selfish,” unicellular-like life features, which are linked to the increased adaptability to environmental changes displayed by cancer cells. Mechanisms of stress response, resembling those observed in unicellular organisms, are actively exploited by mammalian cancer cells to boost genetic diversity and increase chances of survival under unfavorable conditions, such as lack of oxygen/nutrients or exposure to drugs.”
“The hallmarks of cancer (the common traits of human tumors) include many features reminiscent of a “selfish,” unicellular-like life.”
“Cancer cells display properties that parallel the behavior of unicellular organisms. The biological, genetic, and metabolic features shared between cancer and bacterial cell populations include competition between clones, glycolytic metabolism, formation of communities by manipulating the environment, stress responses leading to increased genetic instability, and adaptation to hostile conditions such as drug treatments.”
“By dysregulating cellular processes associated with the transition from unicellular to multicellular life, cancer cells activate survival strategies, including rapid proliferation and adaptability to stressful environments, that had been perfected by autonomous organisms such as bacteria. In other words, cancer, a disease of multicellular organisms, shares biological features with prokaryotic cells. Keeping in mind the remarkable evolutionary distance between bacteria and mammalian cells, here we discuss the scientific and therapeutic implications of considering cancer cells as selfish forms of life which subvert multicellularity laws and behave as single competing units much like bacteria.”
I’m also quoting some of your previous comment italicized and indented.
No, the probability is actually smaller. 1/4^543000 is what you get if you have an infinite supply of nucleotides and you only need to worry about arranging 543000 of them in the correct order. If you start with fewer nucleotides and their population mix is not exactly what you need, at some point you will end up having some trouble getting the scarcer nucleotides, which decreases the probability. The effect can be very large if the surplus is small, but starting from 4.6 million base pairs the effect won’t be large. Not worth calculating exactly.
I read up a little on Spiegelman’s Monster (https://en.wikipedia.org/wiki/Spiegelman%27s_Monster), but couldn’t find anything on EP1. I wouldn’t call Spiegelman’s Monster a “synthetic organism” because it’s really the combination of a piece of RNA and a very special polymerase that is not encoded in the RNA. So, even if the polymerase could help the RNA to replicate, how do you replicate the polymerase? Or maybe you need just one polymerase molecule that somehow just managed to bump into the RNA?
The whole idea of requiring a membrane is an important piece of most abiogenesis mechanisms that have been proposed. I didn’t invent it and, frankly, I don’t quite understand it. The Wikipedia article on Abiogenesis has a summary of it: https://en.wikipedia.org/wiki/Abiogenesis
Likewise, I don’t know enough about cancer to debate on whether or not an RNA molecule behaves exactly like cancer. Consider my comment a simple simile.
I’m getting tired of this back and forth, so I guess I’ll end this thread right here. Thanks!
I’ll also be happy to end this thread after a third and final attempt to explain one point that I’ve obviously failed to explain clearly on my first two attempts.
1/4^543000 is what you get if you have an infinite supply of 4 different types of nucleotide bases or base pairs, you choose a sequence of 543,000 of them at random, and all 543,000 are both correct and in the correct order.
I’m suggesting that a more appropriate model is if you have an infinite supply of 4 different types of nucleotide bases or base pairs, you choose a sequence of 4.6 million of them at random, and within these 4.6 million, there exists a subsequence of 543,000 nucleotide bases or base pairs that are both correct and in the correct order. In general, not all the nucleotide bases or base pairs in the subsequence will have been consecutive in the original sequence. Since there are a great many possible subsequences of length 543,000 from an original sequence of length 4.6 million, this increases the probability by orders of magnitude.