For Darwin Day: On the Origin of Genetic Information

by Eric Drexler on 2009/02/12


The Darwin Day blogging badge, framed with CRD and 200 in binary notation.

Darwin by the numbers

The ideas that evolved from Darwin’s thought have shaped my thinking for more than 35 years, and a decade later, writing Engines of Creation, I relied on the generality of evolutionary principles as an anchor point for surveying the future of technology. Today, in my home, “Uncle Charles says…” means “Evolutionary principles say…”

In joining the festive blogging on this Darwin Day, the 200th anniversary of his birth, I’d like to describe a way of thinking about the origin of genetic information, a perspective that quantifies the stream of information that shaped our genomes over the billion years (and more) before us. This may, in a small way, help to counter some of the disinformation (and honest confusion) around Darwin’s ideas.

Information from the genetic guessing game:
About one bit per life

The gene pool of a species contains information about both organisms and their environment; for an animal species, information about protein stability, catalysis, and hemodynamics; about how to feed, evade, and reproduce. Each individual organism born into the world is like a question that genes ask of the environment, a question about fitness, and the answers can be thought of as numbers representing reproductive success. And like the answers in the game of 20 Questions, each answer conveys information that affects future guesses.

How much information does this convey? For a sexual, diploid organism, the environment responds with answers of 0.0, 0.5, 1.0, 1.5, and so on; over time, in a finite world, the mean value near of these numbers must be close to one. The recipient of this data is not an organism, nor a genome, but the gene pool of a species, and the information conveyed could be encoded as a string of numbers, one number per organism-life. The Shannon entropy — the information content — of a string like this looks to me to be not much more than 1 bit per number (and there’s a redundancy associated with two parents sharing a success). Roughly speaking, one life adds one bit to the stream of raw data from which statistical patterns of adaptive quality emerge.

Too much information!

Considering all this, is the information content of a mammalian genome large, or hard to explain? Consider the numbers.

Since the beginning of biological time, there has been a gene pool in existence that evolved to become our own. How many organism-lives, how many bits of information, has the environment input to that evolving gene pool? Pick your own numbers if you please, but here is a set that gives the general idea: Across a billion years and a billion generations of sometimes fishy existence, a population with million organisms would experience 1015 organism-lives, yielding 1015 bits.

The information content of a mammalian gene pool is roughly the same as that of a genome. Taking the human genome as a baseline (and rounding down to correct for redundancy), the information content is about 109 bits.

In short, the raw information flowing from the environment to our gene pool has been vastly more than the information it has retained, larger by a factor of perhaps a million. Even allowing for great inefficiency, this information input seems, and was, enough.


In an essay on “The Information Challenge”, Richard Dawkins discusses the information content of a genome, and remarks that

Mutation is not an increase in true information content, rather the reverse, for mutation, in the Shannon analogy, contributes to increasing the prior uncertainty. But now we come to natural selection, which reduces the “prior uncertainty” and therefore, in Shannon’s sense, contributes information to the gene pool.

I’ve outlined one way to estimate the the quantity of this information, in its raw, input form; if a reader will direct me to a source to cite, I’d be grateful.


Update:

I thank Darius Bacon for providing a link to a work that includes an analysis of this and several related topics: Chapter 19 of Information Theory, Inference, and Learning Algorithms, by David MacKay of (yes!) Darwin College, Cambridge. See the comments for an extended discussion.


See also:

  • On the evolution of technology: The Technology Tree
  • On information in fabrication: From Self-Assembly to Mechanosynthesis

  • ale February 12, 2009 at 12:28 pm UTC

    A while ago, E Yudkowsky pointed me to
    http://dspace.dial.pipex.com/jcollie/sle/ which claims information gain speed limits of 2 log (number of offspring per parent) regardless of population size, on arguments which I vaguely understand.
    Also, by the mutation rates, maximum information content can also be bound (10^-8 error rate => ~10^8 bits maximum information content?)

    ale February 12, 2009 at 12:31 pm UTC

    That information gain speed bound should have read:
    2 log(number of offspring per parent) per generation

    Eric Drexler February 12, 2009 at 7:41 pm UTC

    There can’t be a fundamental, size-independent limit on the rate of information gain population per generation, because real-world “populations” aren’t distinct and fundamental entities (credit to Mark S. Miller here for the general heuristic that highlights this point). To illustrate, imagine that I have a population of a billion organisms, and someone claims that there is a fixed limit on the rate of information gain (which can be read as a limit on the rate of accumulation of useful genetic innovations). If I divide the population into a million separate populations, the supposed limit now would allow useful genetic information to increase a million times faster. After a few generations, I merge the populations, and could have achieved advances that would supposedly have required ten million generations. A principle with this consequence obviously makes no sense.

    The idea that a fundamental principle places a limit on the amount of genetic information that can be sustained by selection in the face of a given mutation rate is more plausible, but it doesn’t work either, at least without further qualification. A fundamental principle must apply within any reasonable thought experiment, so consider this one: A species, like some marine organisms, spawns to produce many larvae, but larvae (here I choose an illustrative but not biologically realistic parameter) are so sensitive to mutations that all but one in a million mutations causes a larva to die before it reaches the reproductive stage. Roughly speaking, this would allow a million-fold greater mutation rate than the principle seems to imply. (With further qualification of the effect of the mutations, however, something like this principle makes more sense.)

    An interesting fact about the potential reliability of molecular machinery: Mutation rates in replication of human genetic information can be as low as 10^–11. The key is error-checking achieved by expending energy, and there is no fundamental limit to the potential reliability. This, of course, has more general implications for molecular assembly.

    Darius Bacon February 12, 2009 at 11:16 pm UTC

    MacKay in Information Theory, Inference, and Learning Algorithms, Chapter 19, analyzes a model of this question. I haven’t checked it.

    Eric Drexler February 13, 2009 at 4:45 am UTC

    Thank you, Darius. You’ve offered a very appropriate reference for the occasion of Darwin Day, since the topic is evolution, and the author, David MacKay, is at Darwin College, Cambridge.

    The book is directly relevant to the topic at hand, and even a brief examination indicates that it is a high-quality and thoroughly mathematical work. Several of MacKay’s conclusions bear directly on the discussions above and in my post:

    Regarding information inputs from the environment to a gene pool via selection, MacKay derives similar estimates by applying the same fundamental principle. It’s good to have a citation for this.

    MacKay also derives a bound on the maximum tolerable mutation rate for a sexually reproducing species; this is larger than one might expect, because it scales as the square root of the genome size. He notes that the maximum tolerable mutation rate is, as I argued above, partly determined by the number of offspring, and he quantifies this.

    Regarding the rate at which a gene pool can incorporate useful information from the stream of raw information that flows from selection, MacKay derives a bound (again, for a sexually reproducing species) that is the square root of the genome size (in bits) per generation per population. For a genome of 10^8 bits, this allows an information gain, not of just 1, but of 10^4 bits per generation per population.

    This leaves the problematic “per population” aspect of the theorem. I assume that the properties of the mathematical idealization that MacKay terms a “population” capture an important aspect of reality, and that they support his theorem. Turning to the messier real world, however, the population splitting and re-merging thought experiment that I described above suggests caution in applying the theorem to non-idealized populations, or in drawing broad conclusions about the capabilities of artificial evolutionary systems.

    (A more technical note: Non-idealized populations, including those subject to splitting and merging, can have strong linkage disequilibrium, that is, non-random combinations of genes. When considering a relatively small number of generations, the units of selection are on the scale of chromosome segments, which contain many genes as usually defined. A particularly clear (and extreme) case of linkage disequilibrium arises in nature when several point mutations have modified the structure of a single protein: This set of mutations will propagate as a single unit of selection. Effects of this sort must be taken into account when considering the capacity of selection pressure and trying to reason at the level of bits.)

    ale February 13, 2009 at 1:28 pm UTC

    In the link I posted, the author does mention the division into subpopulations to beat the limit, but he argues the situations in which they would beat the limit would rarely arise in nature since each sub-population would have to be subject to independent selection pressures and when recombined, all selection pressures would have to remain present. In the best of cases, assuming the argument in the link is correct, the limit would still be O(k), not O(size of genome).

    The trick to beat the mutation rate is clever, but again, it is probably not realistic. It is relevant to find bounds to the gain by the principle of evolution, but not to the content of information of natural life or in us.

    Chris Phoenix February 14, 2009 at 11:38 am UTC

    If (which I’m not sure of) new species usually arise from small isolated sub-populations, then any animal that’s not an ancestor of a sub-population which produces a new species will have its accumulated genetic “learning” disappear when its species goes extinct. For example, not only are all Neanderthals post-dating the split-off of humans wasted experimentation, but all Neanderthals that lived before humans but were not an ancestor of humans were also wasted experimentation.

    This effect may not matter much, because if we look back over a lot of generations, the bottleneck organisms will have lots of ancestors. How many depends on factors including the mobility of the life form in question. A reasonably mobile species may have fairly complete mixing across its whole range within, say, 100 generations, which is a small fraction of the million-year span that I’ve heard quoted as the average species span.

    So this is probably only a second-order correction. But it might be worth mentioning.

    Chris

    Darius Bacon February 15, 2009 at 10:47 pm UTC

    I didn’t know he was at Darwin College — what a neat coincidence.

    Another factor left out that’s perhaps worth mentioning: genetic polymorphism (like the heterozygote advantage behind sickle-cell anemia).

    Bruce Smith February 17, 2009 at 7:51 am UTC

    You wrote:

    If I divide the population into a million separate populations, the supposed limit now would allow useful genetic information to increase a million times faster. After a few generations, I merge the populations, and could have achieved advances that would supposedly have required ten million generations. A principle with this consequence obviously makes no sense.

    Not so fast — when you mentally re-merge the populations, i.e. admit that they were really one population all along, with a single gene pool, the whole-population gene pool is forced to average out what the subpopulation gene pools “learned”, which reduces the amount of information it can retain from the “raw input” of offspring-counts per individual.

    It’s the mixing of genomes within a natural population which justifies thinking of it as a unit (a single “gene pool” from which new genomes are randomly drawn), and which also necessarily reduces its retained information from the raw input (as well as increasing its accuracy as a measurement of the environment, to the extent the environment is uniform). If this mixing breaks down, more information might be retained; but also, the organisms might speciate (perhaps not a coincidence?).

    My conclusion from this: your thought experiment isn’t a good reason to doubt the plausibility of a population-size-independent upper limit on information retained in a population, if by a “population” we mean something justifiably thought of as having “a single gene pool”.

    (I’m not saying I’ve given evidence for such a limit — ale’s reply sounds like it points to something more sophisticated, which does that — only for your thought experiment not ruling one out.)

    Of course there can be ambiguity in what counts as “one population”, just like “one mountain” (e.g. it might have subpopulations with limited but nonzero gene exchange between them; one mountain with two peaks might be thought of as two nearby mountains), but that doesn’t make the concept of a population arbitrary or unnatural — it’s based on what genome mixing is actually occurring.

    Bruce Smith February 17, 2009 at 7:53 am UTC

    Clarification: by “information retained in a population” above, I meant “new information (derived from selection) retained in a population, per generation”.

    Eliezer Yudkowsky February 18, 2009 at 10:24 am UTC

    MacKay is talking about gaining bits like bits on a hard drive – flipping zeroes in a string to ones; this is not the same as gaining bits in an information-theoretic sense.

    Although I originally thought that an error rate of 10^-8 would imply a maximum genome size of around 10^8, a simple Python simulation failed to validate this. Although it takes one death to remove one mutation from the gene pool, more than one mutation can be removed by the same death. And this indeed gets you an info bound that goes as the square root of genome size, not just selection pressure.

    The simulation was written for perfect selection on an entire gene pool (the exact bottom half being eliminated, and such) and it’s not clear to me what happens when you start introducing more realistic assumptions like stochastic selection on small bands or families.

    It’s at least worth noting that in real life the genome seems to be mostly junk, and even if it’s not, it still contains vastly less info than it “could” according to MacKay’s bound and the age of life on Earth.

    Merging subpopulations is not a trustworthy argument against various attempted speed limits, because the combined population doesn’t instantly get the best of every subpopulation. If you split up into a million subpopulations, then, after merging, every beneficial adaptation promoted in them goes to a frequency of 1 in a million. That’s going to take some time to rise to universality, and any collisions will lose information, and any complex adaptations will be fragmented.

    See this discussion on Overcoming Bias for much more.

    Eric Drexler February 20, 2009 at 11:08 pm UTC

    @ Bruce Smith –
    You make good criticisms of my slapdash argument, which needs some modification.

    First I should say that I assume that analysis made by David MacKay is correct, given its assumptions. Any conflict is only with hasty conclusions that might be drawn from it.

    For background, I’m not concerned with biological realism here, but about constraints that might be incorrectly be thought to apply to all evolutionary systems, including computational systems designed to evolve their information quickly.

    For simply dividing and re-merging a population, let’s substitute ongoing partial division: We have one big, gene-mixing population when viewed on a long time scale, but not with the fastest and most uniform possible mixing.

    Now consider “epistasis”, that is, non-additive gene effects. It is both reasonable and permissible for present purposes to postulate that many mutations produce a small benefit that becomes large only when combined with a second mutation which (for example) fixes a problem created by the first, and would be disadvantageous in itself. The second mutation can’t spread until the first becomes common, and in a small population, this will happen in fewer generations. The story from there involves mixing from one small population to another, or perhaps favorable linkage disequilibrium when mixing into a larger population. This example is enough to show that conclusions drawn from a mathematically simple, idealized population model need not hold in general. (Note that GAs are often coded with structured populations of the sort I described.)

    There is also the question of what the question is. If (as tends to happen) “rate of information gain” is tacitly regarded as meaning “rate of evolution”, there is another problem, because conclusions about the former don’t necessarily tell us much about the latter. (“Rate of evolution” is clearly a meaningful concept, even if it lacks a unique metric).

    Here, the argument is simple: It is permissible for present purposes to postulate a mapping from genotype to phenotype such that very rare mutations are very important; if so, then for a given mutation rate, a large population will find them faster, simply because there are more mutations in total. In this model, for a given rate of information gain, a larger population would achieve a faster increase in information value.

    Here again, I am not saying that this is (or isn’t) a good description of biology, merely that it’s unwise to leap quickly from a theorem about information gain to a conclusion about the rate of significant evolutionary change.

    Will Ware February 24, 2009 at 3:30 pm UTC

    Thanks all for the reference to the MacKay book. It looks very interesting and I am amazoning myself a copy this week.

    { 3 trackbacks }

    Previous post:

    Next post: