I’ve got my hands on a strange paper by D Kanduc: “Protein information content resides in rare peptide segments”. Here’s the abstract.
Discovering the informational rule(s) underlying structure-function relationships in the protein language is at the core of biology. Current theories have proven inadequate to explain the origins of biological information such as that found in nucleotide and amino acid sequences; an ‘intelligent design’ is now a popular way to explain the information produced in biological systems. Here, we demonstrate that the information content of an amino acid motif correlates with the motif rarity. A structured analysis of the scientific literature supports the theory that rare pentapeptide words have higher significance than more common pentapeptides in biological cell ‘talk’. This study expands on our previous research showing that the immunological information contained in an amino acid sequence is inversely related to the sequence frequency in the host proteome.
What? This is an intelligent design paper? How interesting. Unfortunately, the abstract is wrong, and ‘intelligent design’ is not a popular way to explain information in biological systems, and I read through the whole thing, and missed the part where it actually supports ID.
Here’s what the paper actually does: it dissects a sample protein and asks about the frequency of its components in the proteome. It looks specifically at calmodulin (CaM), an important and highly conserved protein that is involved in all kinds of developmental and physiological interactions. The rather arbitrary unit the protein is broken down into is 5 amino acid chunks, or pentapeptides, and each pentapeptide sequence is searched for in genes other than CaM. If this is the initial sequence of CaM,
MADQLTEE…
Then what Kanduc does is search the proteome for MADQL, ADQLT, DQLTE, etc., and count the number of times each appears. Rare pentapeptides are equated with high information content, and common ones are assigned low information content. Some pentapeptides, in his analysis, are found only in CaM, while others are found multiple times, with an average of 12 occurrences. This is supposed to be significant.
It’s also where he loses me. If you search a completely random string of amino acids for an arbitrary pentapeptide, it should turn up, on average, once in every 3,200,000 amino acids. If you search a long enough chunk of amino acid sequence, one that’s long enough to generate on average 12 hits, what you’d expect to see is a bell-shaped distribution — some pentapeptides may appear only once, while others appear dozens of times, just by chance. And that is what Kanduc sees. That some pentapeptides are unique to CaM is perhaps not too surprising, especially when you consider that the proteome is not a random sequence at all, but the product of frequent gene duplications and is also refined by selection.
So far, this idea that some pentapeptides will be rare and others common, is utterly uninteresting and unsurprising. I would have liked to have seen some consideration of the null hypothesis, that the distribution is due to chance alone, but that seems to be totally lacking. If I’d been reviewing the paper, I would have sent it back with a request for revisions to consider that possibility.
However, Kanduc does propose something that actually is interesting: that the rare pentapeptide sequences in specific genes also correlate with regions that have important functional roles.
Using the CaM features, attributes and annotations reported at www.uniprot.org/uniprot/P62158, we find that modification sites, structural beta strand motifs, functional domains, and epitopic determinants are confined primarily to areas of low similarity with the human proteome.
Now that’s kind of cool, if true. It’s also a bit unsurprising. He does examine the length of the CaM protein and show that rare pentapeptide regions are also sites for for acetylation, ubiquitylation, and phosphorylation, and also at the calcium binding site, for instance; but these are functional regions of the protein where one would expect some selection for specific properties. We get a different analysis, in which naturally occurring pentapeptide fragments that are known to have significant biological activity are searched for in the human proteome, and found to be fairly rare. Again, this might be an expected result explained by selection — after all, a sequence that can trigger apoptosis might be expected to be confined by selection to a limited range of sites — and don’t seem to me to require postulating an intelligent designer.
As a paper that hints at some possible functional correlations in the proteome, it’s mildly diverting. It’s weak in that it doesn’t address the null hypothesis very well — I get the impression the author is more interested in fishing for correlations than in actually testing his hypothesis. Where it starts triggering alarm bells, though, is the shoutout to creationists. Kanduc says this about CaM:
…the CaM sequence is characterised by both specificity and complexity (what information theorists call ‘specified complexity’); in other words, it has ‘information content’.
Uh-oh. “Specified complexity” is a meaningless phrase; the creationists have not defined how to measure “specification”. In this case, Kanduc hasn’t either, and his criterion for calling it “specified complexity” is that CaM has various functional domains, which is kind of expected for a protein that has functions. I find it interesting, too, that he doesn’t provide a citation for his claim — Dembski doesn’t get an acknowledgment. Probably because it would be a too-obvious hint about where in looney-land this idea is coming from, and because Dembski doesn’t bother to explain how to calculate “specified complexity” either.
Also, there’s something suspicious about the phrasing there — it seems to be straight out of Meyer 2000:
Systems that are characterized by both specificity and complexity
(what information theorists call “specified complexity”) have
“information content”.
Hmmmmm. (Thanks to Blake Stacey for picking up on that identity.)
Another problem with the paper is the conclusion, which is some unholy amalgam of a dog’s breakfast and a word salad, and either way is grossly unappetizing.
Researchers in the fields of biology and immunology need to define objective informational entities and reductionist basic laws that are valid everywhere and for everything. As new objects and scientific laws are absorbed into experimental protocols and reports, abstract terms such as “sense”, “edit”, and “attack” as well as old dogmas such as the self/non-self dichotomy will become obsolete in favour of more intelligible and concrete theories and biological activities. This process will enable the effective translational application of science to medicine.
What the heck does that mean? What does it have to do with the rest of the paper? Again, if I’d been reviewing it, that would have gone back with a recommendation to delete the gobbledygook and write a conclusion that actually makes sense in the light of the rest of the paper.
What we have here is yet another case of poor reviewing and editing. There is a germ of an interesting observation in the work that the author fails to examine critically and convincingly, but the main intent seems to be to inject the words “intelligent design” into a reviewed scientific paper (while failing to justify why that is a useful hypothesis) and for the author to ride some obscure immunological hobbyhorse which is also not addressed by any of the data. It’s remarkably sloppy work that should have been sent back for extensive revision, rather than being published as is.
I do notice that it was received at Peptides on 20 January, and then bounced back and accepted after what must have been only minor revisions only two weeks later. The journal is commendably fast in its turnaround, but this looks like a case where haste just churned up the garbage a bit more.
Kanduc D. Protein information content resides in rare peptide segments, Peptides (2008), doi:10.1016/j.peptides.2010.02.003





