Wednesday, February 3, 2010

Going Rogue

If we've learned anything from US politics over the past year and half its that adding 'rogues' to an an otherwise orderly system can result in rapid descent to the lowest common denominator. In the latest issue of Systematic Biology, Thomson and Shaffer show that the same is true in phylogenetics. With the goal of reconstructing a robust phylogenetic hypothesis for turtles using existing sequence data, Thomson and Shaffer use a new pipeline and the data available via the PhyLoTA browser to assemble a dataset that is noteworthy for both its size and incompleteness: 223 taxa, 53,406 bp, 7.59% complete. Although Thomson and Shaffer explore the influence of a variety of factors on phylogenetic inference, they conclude that rogue taxa "probably represent the most insidious problem for supermatrix phylogenetics." For those unfamiliar with rogue taxa, the term is used to describe taxa whose phylogenetic position can vary dramatically without having a strong effect on a tree's overall score. Thomson and Shaffer identify rogues using the taxonomic instability index (I) calculated by Mesquite from a sample of trees generated using standard bootstrapping methods. To explore the influence of rogues on phyogenetic resolution, Thomson and Shaffer remove those taxa exhibiting the most roguish behavior and redo their analyses. The top panel of their Fig. 4 provides a compelling visual representation of the results of this exercise, with the completely unresolved tree on the left being generated before pruning rogues and the nearly fully-resolved tree on the right resulting from analysis of the same dataset subsequent to de-roguing. Although one might challenge the wisdom of deleting problematic taxa until a resolved tree is produced, this practice may be justified in some instances. How we deal with rogue taxa is sure to be a topic of debate in the years to come, but, for the time being Thomson and Shaffer's analyses suggest that simply deleting the taxa with the worst I values may be a reasonable solution.

7 comments:

220mya said...

I think the Inoue et al. paper in the same issue is equally interesting, if not more so! Hopefully you will cover that in a future blog post? Or is it too scary for many molecular phylogeneticists...

Inoue, J., P.C.J. Donoghue, and Z. Yang. 2010. The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times. Systematic Biology 59(1): 74-89.

Tom Near said...

What will be interesting to determine is how much of the instability is a result of the incompleteness of the data matrix. What would be the relationship between the fraction of rogue taxa and the completeness of the matrix?

Glor said...

I think the question about the relationship between rogue taxa and incompleteness is going to require another dataset. The Thomson and Shaffer one is so incomplete that it's not well-suited to manipulating the level of completeness, as other recent studies of incomplete datasets have done.

Glor said...

@220mya
Our lack of a post on Inoue et al. is due more to our need to fulfill other (non-blog) obligations than the fact that their results have scary implications. If there are things about the Inoue et al. paper that you want to share with readers of dechronization, we'd be happy to consider a guest post.

someone said...

May I suggest the cetacean supermatrix of McGowen et al, 2009. MPE, 53: 891-906. 91 taxa; ~42,000 bp; (not sure of the percentage of missing data but probably not as bad as the turtles). I also think that having one gene in which there is information for everything can help a lot with rogue taxa--I know this probably isn't a situation that is available to all taxa.

220mya said...

Heehee, I was just giving you folks a hard time. This blog has some of the best posts on phylogenetic methods! Alas, I'm afraid I don't have a firm enough grounding in Bayesian theory to produce a proper post. However, I think the key point (as many have said), is you really need to understand what data are going into your models, and we need more evidence from sensitivity analyses. The time of producing back-of-the-napkin molecular age estimates is over. People really need to explicitly justify in the paper the phylogenetic position of the fossils, the age uncertainty associated with these fossils, and how that information is used to generate the priors. The unfortunate thing is that many of these papers are never sent to paleontologist peer-reviewers, folks who could easily correct some of the scarier misappropriations of fossil and geologic data.

Susan Perkins said...

Good points, 220mya. When I taught a class in applied phylogenetics last semester and we covered fossil (and other) calibration, it was scary to see how much the whole thing can be like a game of "telephone" with dates morphing slightly or molecular clocks being extended from one species to the next with no explanation except an unsatisfying reference for something else. But perhaps that's all another post. Tom's probably the best to write that one up, as he's published some papers on it.