Some responses to “A cautionary tale on reading phylogenetic trees”


PLoS ONE logo

Back in September, I complained that a PLoS ONE article purporting to provide “valuable insight into the evolution of eukaryotes” contained substantive problems that should have been caught during the peer review process (“A cautionary tale on reading phylogenetic trees“). The problems are so serious that, in my opinion, they render the bulk of the results invalid.

There were also numerous problems with the interpretation of those results, mainly stemming from misunderstandings about what kinds of information phylogenetic trees represent:

Some of these problems are just rhetorical, but some of them are substantive, and this is the real problem. A failure to understand that phylogenies represent sister group relationships has led to incorrect interpretations of evolutionary relationships, such as that the outgroup is more closely related to one ingroup clade than another, that the sister of one clade is a ‘link’ to another clade, and that a single branching event can have a bunch of different divergence times.

I later admitted, in response to criticism from a reader, that I may have been overly pedantic in pointing out some of the rhetorical problems (“A valid point“). In this post, though, I’m going to focus on the substantive problems and respond to a couple of comments to the original post.

Before I do, I want to say that I don’t bear the authors of this paper any ill will. As I said before,

Just as when I’m reviewing a paper, I try to remember that this is most likely someone’s master’s thesis or a chapter of their PhD dissertation, and the lead author might have a couple of years of their life invested in it. I feel bad heaping criticism on someone’s hard work, but the peer review process exists for a reason. I see this as a failing of the reviewers, the editorial staff, or both.

But the paper is flawed, deeply flawed, and it would be irresponsible to pretend otherwise. PLoS ONE has been “looking into the concerns raised” for over three months now, but they haven’t yet taken any action that I’m privy to. On December 4th, the journal’s Publications Manager was only able to tell me that they appreciate my patience and are continuing to pursue the issue.

The first substantive problem, and the one that got this on my radar in the first place, was the choice of outgroup. The study was an effort to infer evolutionary relationships among eukaryotes, including animals, fungi, protists, and plants, and the authors chose the green alga Chlamydomonas reinhardtii as the outgroup. They also present Chlamy‘s outgroup status as both an assumption (which it is) and a result (which it isn’t), but that’s a rhetorical problem, not a substantive one.

For a study including plants, animals, and fungi, a green alga is a bizarre choice for an outgroup. Here’s why: an outgroup cannot, by definition, be part of the ingroup. In cladistic terminology, the outgroup is sister to the clade that comprises the ingroup, in other words all of the species in the ingroup are more closely related to each other than any is to the outgroup. A choice of outgroup therefore reflects a claim about evolutionary relationships. In this case the claim the authors have made is that plants, animals, and fungi are more closely related to each other than any is to Chlamydomonas.

Unfortunately, that claim is at odds with a massive body of literature on eukaryotic relationships, which is nearly unanimous in concluding that green algae, including Chlamydomonas, are more closely related to plants than either is to animals or fungi. Here, for example, is what has to be one of the most reproduced figures in evolutionary biology:

Baldauf Fig. 1

Figure 1 from Baldauf 2003. A consensus phylogeny of eukaryotes. The tree shown is based on a consensus of molecular and ultrastructural data.

Here’s another one, from Patrick Keeling and colleagues:

Keeling et al. 2005 Fig. 1

Figure 1 from Keeling et al. 2005. A tree of eukaryotes. The tree is a hypothesis composed from various types of data, including molecular phylogenies and other molecular characters, as well as morphological and biochemical evidence.

Both show green algae as sister to land plants + charophytes. Here’s a more recent example:

Figure 1 from Sharpe et al. 2015. Phylogenetic tree of eukaryotes based on a phylogenomic dataset. Numbers indicate bootstrap support (BS) for splits estimated from 500 pseudoreplicates. Split with BS < 100% are shown (all others are 100%).

Once again, green algae are in a clade (Chloroplastida) that is distantly related to animals (Metazoa) and fungi. I could keep at this all day, but rather than post dozens of trees I’m going to beg your indulgence in what might be criticized as an argument from authority: I have looked at a lot of phylogenetic trees of eukaryotes. A lot. I’m not aware of any (other than that of Jayaswal et al.) that contradict a sister group relationship between green algae and plants (including Charophytes). If anyone is aware of such a phylogeny, please say so in the comments. What I’m trying to say is that this is not a controversial topic. It is about as close to firmly established truth as it’s possible to get in the study of evolutionary relationships. Keeling et al., for example, say “the sister relationship of plants and green algae is beyond doubt.”

To recap, choosing Chlamydomonas as the outgroup amounts to a claim that plants are more closely related to animals and fungi than to green algae. If green algae are sister to plants, which is so well supported that it might fairly be described as a fact, this cannot be true.

Why, then, did Jayaswal and colleagues choose a green alga as the outgroup for their analyses? I don’t have to speculate, because they told me. In two separate comments, the lead author and the corresponding author defended their choice. First, from the corresponding author, Nagendra Kumar Singh:

There are always problems associated with the interpretation of the tree of life. We always get discrepancies no matter how well we optimise the parameters. We have reported the results in an unbiased way as we got it. Selection of C. reinhardtii is valid as also indicated in the original C. reinhardtii genome. However, we acknowledge some important points have been raised that will be addressed in due course of time, particularly we should have started all these extanct species at time zero for ease of viewing.

Some of this is in response to my criticisms of Jayaswal and colleagues’ interpretation of their tree, but I’m going to skip that for now and focus on the substantive. Dr. Singh’s rationale for choosing Chlamydomonas as an outgroup is, “Selection of C. reinhardtii is valid as also indicated in the original C. reinhardtii genome.”

I have a more than passing familiarity with the Chlamydomonas genome paper (Merchant et al.). Here’s what it says about phylogenetic relationships:

The Chlorophytes (green algae, including Chlamydomonas and Ostreococcus) diverged from the Streptophytes (land plants and their close relatives) over a billion years ago. These lineages are part of the green plant lineage (Viridiplantae), which previously diverged from opisthokonts (animals, fungi, and Choanozoa)

This directly contradicts the choice of C. reinhardtii as an outgroup. Chlamydomonas can’t be both “part of the green plant lineage” and its most distant relative.

The lead author, Pawan Kumar Jayaswal, went into more detail, and I’ll take that comment in parts:

Chlamydomonas reinhardtii (Cri) is a single cell green algae which retain the common features of plant (chloroplast-based photosynthesis) and animal (eukaryotic flagella). Merchant et al. (2007) in Chlamydomonas genome sequencing project mentioned about the divergence of its lineage from land plants over billion of years ago. Similarly, Yoon et al. (2004) estimated the split of red and green algae occurred about 1500 Mya. Outgroup which we have selected based on the above information and selected species is distantly related with the in-group species.

How long ago Chlamydomonas diverged from land plants does not, by itself, tell us anything relevant. The relevant question is whether Chlamy diverged from land plants before land plants diverged from animals and fungi, which it did not. The Chlamy genome paper says it did not (“part of the green plant lineage”), and Yoon et al. does not include animals or fungi, so it can’t possibly inform this question.

We have included twenty animal, seven fungi and four protista species in our sample data. Concatenated multigene based Bayesian phylogenetic tree showed the two protista species intermediate between the animal and fungi similar result based on limited number of genes (EF-1a, actin, b-tubulin, and HSP70, and/or a-tubulin) reported by Steenkamp et al. in 2005 [nucleariid a group of amoeba appears as the closest sister taxon to fungi and choanoflagellates group with sister lineage of animal (Hedges et al. 2004)].

Here’s the Steenkamp tree:

Steenkamp et al. 2005 Fig. 1

Figure 1 from Steenkamp et al. 2005. Monophyly of animals, fungi and their protistan allies based on concatenated EAHβ protein sequences.

Chlamydomonas is sister to land plants. This is incompatible with its selection as outgroup. Here’s the tree from Hedges et al.:

Hedges et al. 2004 Fig. 2

Figure 2 from Hedges et al. 2004. A timescale of eukaryote evolution.

Again, Chlorophytan green algae (which include Chlamydomonas) are sister to the land plants. As with Baldauf, Keeling, Sharpe, and Steenkamp, this contradicts the choice of Chlamydomonas as an outgroup.

Our 98 gene based phylogenetic tree analysis clearly aligned with the published report and at the same time we have not claimed in our paper as Chlamydomonas reinhardtii is the origin, on the basis of above literature we selected Cri as a outgroup.

The above literature gives no basis for selecting Chlamydomonas as an outgroup. Every paper the authors have cited in support of this choice actually contradicts it. Like so many of the problems with their paper, their belief that the literature supports their choice of outgroup is based on misinterpretations of phylogenetic trees.

The problems with the divergence time estimates really are too numerous to list, but here are some highlights. In some cases, the same fossil calibration was used for different divergences (at least, that’s what their Table 3 indicates). For example, a calibration of 110 million years was used for the divergences between rice and banana (O. sativa and M. acuminata), between rice and corn (O. sativa and Z. mays), and between rice and purple false brome (O. sativa and B. distachyon). They can’t all have diverged 110 million years ago:

 

Figure 7 detail

Detail of Figure 7 from Jayaswal et al. 2017.

Worse, there are contradictions among their results. For example, their inferred divergence time between O. sativa and M. acuminata is 22.03 million years; between O. sativa and B. distachyon 61.69 million years. But according to their own tree, M. acuminata diverged from O. sativa before B. distachyon did. Both things can’t be true.

In other cases, very different divergence times are estimated for the same divergence. For example, Chlamydomonas is inferred to have diverged from rice 180 million years ago, from the moss Physcomitrella patens 516 million years ago, and from potato blight (Phytophthora infestans) 1118 million years ago. The problem is that, again according to their own tree, these are the same divergence: if Chlamydomonas is the outgroup, then all ingroup species diverged from Chlamydomonas at the same time by definition.

Similarly, the estimates of the divergence between plants and animals range from 110 million years (sorghum versus hydra) to 549 million years (Brachypodium versus mouse) (there is also an estimate of zero–Brachypodium versus Hydra–but I assume this is a typo). Aside from being at odds with the fossil record, that’s a nearly five-fold range of estimates for the same divergence. If there were no other problems with this paper, the inference of a mid-Cretaceous divergence between animals and plants should have raised questions about the reliability of the results.

 

Jayaswal et al. Fig. 7

Figure 7 from Jayaswal et al. 2017. The blue circle (which I added) represents the divergence between Sorghum and Hydra, estimated at 110 MYA. It also represents the divergence between Brachypodium and Mus, estimated at 549 MYA.

There are several possible explanations for these problems making it through peer review and into a published paper. One is that the reviewers did not identify the problems. If so, this is a massive failure on the part of the peer reviewers, the handling editor, or both. If the handling editor was unable to identify qualified reviewers, that would explain the failure to detect the problem with outgroup choice. Without delving into the literature, a reviewer would have to have at least a passing familiarity with eukaryote relationships to know that Chlamydomonas is not a suitable outgroup. Even a reviewer with no background in phylogenetics, though, should have noticed the self-contradictions in divergence time estimates.

Another possibility is that the problems were identified, but the authors argued against them and convinced the handling editor that they were baseless. If this turns out to be the case, the editor has some ‘splaining to do. It would not be difficult to assess the claim that Chlamydomonas is an unsuitable outgroup, even less so to see that there are huge inconsistencies among the divergence time estimates for the same node.

At some stage, the peer review process failed these authors. The problems I’ve identified here and in my previous post could have been fixed, and they should have been fixed as a condition of publication. The best-case scenario here is that the handling editor chose reviewers who were not qualified to evaluate phylogenetic and molecular clock analyses. Even that scenario leaves some blame for the reviewers; any biologist (any educated human, really) should have recognized the absurdity of a mid-Cretaceous divergence between plants and animals.

What should be done? The problems with this paper seem too big and too pervasive for a correction. The phylogenetic analysis would have to be redone with an appropriate choice of outgroup, the divergence time estimates redone with more rigorous methods, and large portions of the paper would have to be rewritten to remove faulty interpretations of the phylogenetic tree.

At this point, the paper is still on the PLoS ONE website in its original published form. No editorial expression of concern has been posted, and the only comment is a reference to my earlier blog post (I didn’t post the comment). Since the journal has been aware of the problems for over three months, the lack of timely corrective action is a failure in itself.

 

Stable links:

Baldauf, S.L. 2003. The deep roots of eukaryotes. Science, 300: 1703–1706. doi: 10.1126/science.1085544

Jayaswal, P.K., Dogra, V., Shanker, A., Sharma, T.R. and Singh, K. 2017. A tree of life based on ninety-eight expressed genes conserved across diverse eukaryotic species. PLoS One, 12: e0184276. doi: 10.1371/journal.pone.0184276

Keeling, P.J., Burger, G., Durnford, D.G., Lang, B.F., Lee, R.W., Pearlman, R.E., et al. 2005. The tree of eukaryotes. Trends Ecol. Evol., 20: 670–676. doi: 10.1016/j.tree.2005.09.005

Merchant, S.S., Prochnik, S.E., Vallon, O., Harris, E.H., Karpowicz, S.J., Witman, G.B., et al. 2007. The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science, 318: 245–250. doi: 10.1126/science.1143609

Sharpe, S.C., Eme, L., Brown, M.W. and Roger, A.J. 2015. Timing the origins of multicellular eukaryotes through phylogenomics and relaxed molecular clock analyses. In: Evolutionary Transitions to Multicellular Life (I. Ruiz-Trillo and A. M. Nedelcu, eds). Springer. doi: 10.1007/978-94-017-9642-2

Comments

  1. another stewart says

    I think you’re misreading the calibrations. The dates aren’t the (calibrated) dates of divergence of pairs of taxa, but the dates used to calibrate the divergence date of those pairs of taxa. For example they state in the text that 110mya is the date of origin of (crown?) angiosperms. (They may actually be using it for monocots, ‘cos they’ve got a 200mya date that they seem to be using for angiosperms.) Naively you can then estimate the age of divergence of a pair of taxa by the ratio of the divergence between their genomes, and the divergence between them and a representative taxon from the sister group.

    Even so, the table of divergences is a mess – they got a date of 35.84 mya for the divergence between Pinus and Musa, which fails the laugh test, especially as it’s less that the divergence between two species of Pinus. (I’m hoping that’s a misprint; if not that’s a red flag there’s something wrong with their methodology.) Similarly for the date for 46.65 mya for the divergence between Triticum and Arabidopsis.

    In the light of the quoted comments in your post of today I’m wondering whether most common ancestor means something like a non-tree thinkers equivalent to most plesiomorphic ancestor. (Which would represent another questionable assumption – that morphologically conservative taxa are necessarily genetically conservative.)

    • Matthew Herron says

      You may be right; it’s difficult to tell exactly what they did from the Methods. They never say “We calibrated the divergence between x and y at z MYA,” leaving us to wonder exactly what these dates represent.
      If they’re using 110 MYA for the origin of monocots, though, shouldn’t Oryza vs. Musa be 110 MYA, not 22 MYA as they inferred? Similarly, if 110 MYA is for the origin of angiosperms, shouldn’t Triticum vs. Arabidopsis be 110 MYA rather than the 47 MYA they inferred?

      • another stewart says

        The Musa-Oryza divergence is 6 nodes in (on the tree presented at WikiPedia) from the root of crown angiosperms, but yes 22 mya seems rather short for Musa-Oryza. The WikiPedia article puts dates on the nodes, and has 131 mya on crown monocots (LCA of Acorus and all other monocots) and 118 mya on commelinids (which is the node joining inter alia grasses and bananas), so one would expect a 110 mya calibration to give a Musa-Oryza divergence of around 100 mya.

        And Triticum-Arabidopsis should be at least 110 mya. There’s something gone badly wrong with the calculation of divergence dates and I don’t think it’s just an artefact of sticking the root in the wrong place.

        Their choice of taxa struck me as a trifle odd – 6 grasses and 4 hominoids seems a waste of effort when evaluating grand eukaryote phylogeny, but perhaps they intended to investigate the estimation of divergence times over a range of timescales, so they needed some closely related species. The choice of 46 taxa from Opisthokonta and Viridiplantae among their 49 taxa is also odd.

        I’ve looked at their tree, and it the topology is “wrong” in at least the following points: Solanum should be the outgroup among dicots (Arabidopsis – a malvid – should be sister to the other 5 dicots which are all fabids); humans are the outgroup among hominoids, rather than Pongo; Mus should be closer to hominoids that to Bos; and Dictyostelium should be sister to Opisthokonta rather than Fungi. I don’t trust all nodes on consensus phylogenies, but these divergences are surprising ones, especially for a 98-gene dataset. They are points that I would have asked to be covered in the discussion.

  2. says

    Thanks for spending so much time analysing our results and giving your critical comments. We are in touch with the PLOS One editorial office to address the concerns raised by you.

    I have personally asked my student to develop an unrooted phylogenetic tree of the 49 species and also a tree rooted by a prokaryote bacterium and then we can write a corrigendum for this section. We will also developing chronogram aligning all the taxa to time zero, the present one is a phylgram and the time scale bar should not have been there. In any case that time scale was not used for working out divergence time in our paper.

    The divergence times given in the Table are actually based on the synonymous substitution rates and not on the phylogram. The calibration times for the molecular clock is based on the earliest fossil records for which references are cited in the published Table. We cannot do much there and as I have also pointed out there are anomalies in the divergence times estimated by this method. We will provide a Table showing divergence times based on the new chronogram, which should have no internal contradiction although the divergence times may still vary considerably.

    The paper has much more information than the phylogenetic tree and the divergence time. This includes conservation of genes across species between plants and animals and revolutionary relationship among the 98 conserved genes.

    • Matthew Herron says

      Those would all be welcome improvements. I hope you can find a prokaryotic outgroup that has most of the 98 genes you found conserved across eukaryotes. If I may make a small suggestion, it would be really helpful for the divergence time table to indicate exactly which divergence each fossil calibration corresponds to.

Trackbacks

  1. […] most read science post was one related to the PLoS One debacle back in September, 2017: “Some responses to ‘A cautionary tale on reading phylogenetic trees’“. Two of the authors of the paper I was criticizing had responded in the comments to the […]

Leave a Reply to Nagendra Singh Cancel reply