Tuesday, January 5, 2010

That Darn LBA

Ah...I remember the days. Being a young grad student, trying to wrap my head around systematic biology and finding myself immersed, and sometimes confused, about the debates going on in the literature - and in seminar rooms - over parsimony vs. maximum likelihood. One favored topic of discussion was susceptibility to long-branch attraction (LBA). In my case, I went so far as to organize a graduate seminar that involved reading lots of papers and dragging John Huelsenbeck and Mark Siddall up to Burlington in the dead of winter to try to set us straight. I don't know about the rest of my cohort, but I finished the semester thinking that the only reasonable solution was to do my best to sample enough taxa to disrupt LBA as much as possible. And then, much of the controversy died down for a while. Part of this, I speculate, was due to the development and growth in popularity and theory of using Bayesian inference in phylogenetic analyses. BI was thought to have an advantage over ML in that it could incorporate uncertainty over the "nuisance parameters" in an analysis. A recent paper in PLoS One by Bryan Kolaczkowski and Joe Thornton, however, has raised the ugly head of LBA again. In this paper, Kolaczkowski and Thornton presented convincing data that BI is very susceptible to inconsistency and bias, particularly in cases of LBA (the "Felsenstein zone") - and that these problems are exacerbated when the amount of sequence data increased, with the posterior probability support values for incorrect clades converging to 1.0. Kolaczkowski and Thornton explored these effects with classical four-taxon trees, with real, known-to-be-problematic datasets (the troublesome Encephalitozoon), and other datasets with prescripted heterotachy and other heterogenous parameters in the evolutionary model. Importantly, they contend that "more sophisticated MCMC algorithms and more complex priors" cannot alleviate the bias that BI shows. The blossoming field of phylogenomics and the desire to incorporate larger and larger matrices into our systematic analyses, may thus lead us to produce well-supported but false trees if BI is used, if our datasets contain instances of LBA - and really, whose don't? This was a good read with some very important implications. I'm anxious to hear what others think of it.


Unknown said...

The susceptibility of ML and Bayes to LBA has also not gone unnoticed by the developers of PhyloBayes. We've found that PhyloBayes, which accounts for some artifacts in large data sets, correctly accounts for LBA by demonstrating low support in areas of the tree where ML and Bayes actually gave high support.

The following paper may be of interest: MBE 2009 - Parasitism and mutualism in Wolbachia. PMID: 18974066

Jonathan Eisen said...

I wrote a bit about this here and also included a mini email interview with one of the authors

Susan Perkins said...

Very nice piece, Jonathan. Thanks for pointing that out.

Joe Felsenstein said...

I wonder. If the amount of data is large, Bayesian methods should result in a posterior distribution that is strongly concentrated around the ML tree. And the concentration should get stronger with more and more data. So Bayesian inference, when used to make a point estimate drawn from any reasonable place in the posterior, should be consistent.

So I would put my 25 cents on this dramatic result being an artifact. I have lots of arguments against being a Bayesian, but this isn't going to be one of them, not unless someone can explain to me why it would be expected to happen.

(I was shown an earlier version of this paper, and explained some of my doubts to the authors, but then I got too busy to follow up on it. Can one of you do my work for me and give me a simple explanation?)

Joe Thornton said...

Joe raises an interesting question. We directly address it in our paper in the last two paragraphs of the Results section and in Fig. 6.

Joe suggests that 1) the likelihood function over branch lengths for each topology becomes more peaked as sequence length grows, so 2) one would expect Bayesian inference (BI) to converge on the true tree as the amount of data analyzed grows. The first part of that sentence is true, but (perhaps counterintuitively) it isn't relevant to the second part, which turns out to be incorrect. The relative support for one tree over another in BI is determined by the ratio of the integrated likelihoods for the two topologies (modified by the priors). As the quantity of data grows, the likelihood function does becomes more peaked, but it does so for every topology, and the ratio of the integrated likelihoods (and therefore of the posterior probabilities) for different topologies actually grows more extreme. If there is systematic bias at short or moderate sequence lengths, it will become stronger as more data are added.

The likelihood surfaces over branch lengths, which we numerically characterized in Fig. 6a, illustrate this behavior. When the data are generated on a star-tree with Felsenstein zone branch lengths, there is more volume under the likelihood surface of the LBA tree (red lines) than that of any other tree. Although the peaks on each surface have identical height, the surface for the LBA tree has more butte-like contours, whereas the others have steeper slopes and lower integrated likelihoods. Although the true topology/branch-length combination has a higher point likelihood than any other combination, the LBA tree has a higher likelihood given most other (untrue) branch lengths than the other trees do. As Fig. 6b shows, this is because the convergent state patterns that arise frequently on non-sister long-branches support the LBA tree when incorrect branch lengths are assumed. Thus ML correctly infers equal support for all topologies, whereas BI prefers the LBA tree overall.

Now compare the upper to the lower plots in Fig. 6a. As the amount of sequence data grows, the likelihood function indeed scales so that the contours of each surface become more extreme -– the highs get higher and the lows get lower –- but their basic shapes do not change. Although all the peaks are skinnier, the LBA topology continues to have a more spread-out likelihood surface with higher volume under it than the other topologies do. In fact, the ratio of the volumes under the curves actually becomes more extreme, and support for the LBA tree increases (see Fig. 6c).

The analytical work described in the last two paragraphs of the Results section shows that this behavior is in fact theoretically expected. The integrated likelihood ratio of the LBA tree versus the other topologies will scale linearly with the number of sequence sites analyzed. Because the bias is caused by integrating over branch lengths -- a fundamental feature of fully Bayesian approaches -- it represents an intrinsic problem, not an implementation flaw.

We did this work and included it in the paper because Joe raised this same comment to us after reading an earlier version. We were grateful for Joe's comment because the work it stimulated showed decisively that the observation of increasing bias with increasing sequence length is a necessity, not an artifact.

Joe Felsenstein said...

OK, I have been looking over the paper. I can see how there can be long branch attraction when the prior makes the lengths of the two long branches improbable -- it could prefer to support a tree that had them partly joined, and thus they could each be shorter and thus closer to what the prior expects.

What is less obvious is why this LBA does not go away as longer and longer sequences are simulated. I can see that it might not if the true tree was a star tree, as in Figure 6 which Joe Thornton in the previous comment.

But I would think that in a true tree that is resolved, the expected log-likelihood curve would be higher for the true topology, and with lots of sites the expected difference in log-likelihood would get big, and that would cause the posterior distribution to be more and more concentrated within the true topology. However in Figure 1 (say row a) I don't see that happening.

Joe, am I missing somewhere where you explain this?

Vladimir Minin said...

Joe F, you are not missing anything. The authors have made an integration mistake, pointed out by Mike Steel. Joe Thornton has published a correction.

My random thoughts on the matter are here.

Glor said...

Thanks for your insight Quantifier.

Vladimir Minin said...

I would like to make a suggestion of creating a new post about the Kolaczkowski and Thornton's correction.

I think it is only fair if the correction receives as much attention as the original paper.

The link to my blog entry is on the front page twitter updates and is quite visible. Thank you for that! But I am worried that most people use Google Reader and never see twitter updates.

Joe Thornton said...

It turns out that Joe F. is correct about the asymptotic consistency of BI when the true tree is resolved. Mike Steel graciously helped us to identify an error in our analysis that affects our discussion of the asymptotic performance of BI on resolved trees, shown in parts of Fig. 6 and the last two paragraphs of the results section. The paper's other findings are not affected.

Here are the results of the correction: On resolved trees in the Felsenstein zone (FZ), the strength of BI's long-branch attraction bias decreases in intensity as the amount of sequence data increases. For any given set of branch lengths in the FZ, it appears that there will be some critical sequence length below which BI will erroneously infer the LBA tree as the MAP tree; above this critical number of sites, BI will infer the true tree. For some conditions we studied, this critical length was quite large. Due to this bias, support for the true tree at any finite sequence length will be lower when using BI than when ML is used. When the true tree is unresolved, we observed no reduction in the strength of the bias as sequence length increases.

The overall picture is as follows. BI is subject to LBA bias caused by integrating over branch lengths. This bias reduces the accuracy and efficiency of BI on both simulated and empirical data, and it makes BI more susceptible than ML to error due to model violation. The bias increases in severity when more complex models are required. Its effect can be ameliorated in part by adding more data; under some challenging conditions, large quantities of data are required to overcome it. Under most conditions, BI and ML are likely to infer the same phylogenies; when they do not, LBA bias in BI is a potential cause.

We posted the correction online at PLoS as a comment with a revised figure, equation, and text. It is being processed by the journal will be published/indexed shortly as a formal correction. Discovering an error is no fun, but a nice thing about publishing in an online open access journal is that we were able to correct the scholarly record and inform the scientific community about it so quickly.