The ideas that evolved from Darwin’s thought have shaped my thinking for more than 35 years, and a decade later, writing Engines of Creation, I relied on the generality of evolutionary principles as an anchor point for surveying the future of technology. Today, in my home, “Uncle Charles says…” means “Evolutionary principles say…”
In joining the festive blogging on this Darwin Day, the 200th anniversary of his birth, I’d like to describe a way of thinking about the origin of genetic information, a perspective that quantifies the stream of information that shaped our genomes over the billion years (and more) before us. This may, in a small way, help to counter some of the disinformation (and honest confusion) around Darwin’s ideas.
Information from the genetic guessing game:
About one bit per life
The gene pool of a species contains information about both organisms and their environment; for an animal species, information about protein stability, catalysis, and hemodynamics; about how to feed, evade, and reproduce. Each individual organism born into the world is like a question that genes ask of the environment, a question about fitness, and the answers can be thought of as numbers representing reproductive success. And like the answers in the game of 20 Questions, each answer conveys information that affects future guesses.
How much information does this convey? For a sexual, diploid organism, the environment responds with answers of 0.0, 0.5, 1.0, 1.5, and so on; over time, in a finite world, the mean value near of these numbers must be close to one. The recipient of this data is not an organism, nor a genome, but the gene pool of a species, and the information conveyed could be encoded as a string of numbers, one number per organism-life. The Shannon entropy — the information content — of a string like this looks to me to be not much more than 1 bit per number (and there’s a redundancy associated with two parents sharing a success). Roughly speaking, one life adds one bit to the stream of raw data from which statistical patterns of adaptive quality emerge.
Too much information!
Considering all this, is the information content of a mammalian genome large, or hard to explain? Consider the numbers.
Since the beginning of biological time, there has been a gene pool in existence that evolved to become our own. How many organism-lives, how many bits of information, has the environment input to that evolving gene pool? Pick your own numbers if you please, but here is a set that gives the general idea: Across a billion years and a billion generations of sometimes fishy existence, a population with million organisms would experience 1015 organism-lives, yielding 1015 bits.
The information content of a mammalian gene pool is roughly the same as that of a genome. Taking the human genome as a baseline (and rounding down to correct for redundancy), the information content is about 109 bits.
In short, the raw information flowing from the environment to our gene pool has been vastly more than the information it has retained, larger by a factor of perhaps a million. Even allowing for great inefficiency, this information input seems, and was, enough.
In an essay on “The Information Challenge”, Richard Dawkins discusses the information content of a genome, and remarks that
Mutation is not an increase in true information content, rather the reverse, for mutation, in the Shannon analogy, contributes to increasing the prior uncertainty. But now we come to natural selection, which reduces the “prior uncertainty” and therefore, in Shannon’s sense, contributes information to the gene pool.
I’ve outlined one way to estimate the the quantity of this information, in its raw, input form; if a reader will direct me to a source to cite, I’d be grateful.
I thank Darius Bacon for providing a link to a work that includes an analysis of this and several related topics: Chapter 19 of Information Theory, Inference, and Learning Algorithms, by David MacKay of (yes!) Darwin College, Cambridge. See the comments for an extended discussion.