Comments on dechronization: That Darn LBA

It turns out that Joe F. is correct about the asym...

2010-02-25T19:09:21.644-05:00

It turns out that Joe F. is correct about the asymptotic consistency of BI when the true tree is resolved. Mike Steel graciously helped us to identify an error in our analysis that affects our discussion of the asymptotic performance of BI on resolved trees, shown in parts of Fig. 6 and the last two paragraphs of the results section. The paper's other findings are not affected.

Here are the results of the correction: On resolved trees in the Felsenstein zone (FZ), the strength of BI's long-branch attraction bias decreases in intensity as the amount of sequence data increases. For any given set of branch lengths in the FZ, it appears that there will be some critical sequence length below which BI will erroneously infer the LBA tree as the MAP tree; above this critical number of sites, BI will infer the true tree. For some conditions we studied, this critical length was quite large. Due to this bias, support for the true tree at any finite sequence length will be lower when using BI than when ML is used. When the true tree is unresolved, we observed no reduction in the strength of the bias as sequence length increases.

The overall picture is as follows. BI is subject to LBA bias caused by integrating over branch lengths. This bias reduces the accuracy and efficiency of BI on both simulated and empirical data, and it makes BI more susceptible than ML to error due to model violation. The bias increases in severity when more complex models are required. Its effect can be ameliorated in part by adding more data; under some challenging conditions, large quantities of data are required to overcome it. Under most conditions, BI and ML are likely to infer the same phylogenies; when they do not, LBA bias in BI is a potential cause.

We posted the correction online at PLoS as a comment with a revised figure, equation, and text. It is being processed by the journal will be published/indexed shortly as a formal correction. Discovering an error is no fun, but a nice thing about publishing in an online open access journal is that we were able to correct the scholarly record and inform the scientific community about it so quickly.

I would like to make a suggestion of creating a ne...

2010-02-19T11:32:56.100-05:00

I would like to make a suggestion of creating a new post about the Kolaczkowski and Thornton's correction.

I think it is only fair if the correction receives as much attention as the original paper.

The link to my blog entry is on the front page twitter updates and is quite visible. Thank you for that! But I am worried that most people use Google Reader and never see twitter updates.

Thanks for your insight Quantifier.

2010-02-17T23:25:21.170-05:00

Thanks for your insight Quantifier.

Joe F, you are not missing anything. The authors h...

2010-02-17T22:54:29.362-05:00

Joe F, you are not missing anything. The authors have made an integration mistake, pointed out by Mike Steel. Joe Thornton has published a correction.

My random thoughts on the matter are here.

OK, I have been looking over the paper. I can see...

2010-01-14T19:16:08.808-05:00

OK, I have been looking over the paper. I can see how there can be long branch attraction when the prior makes the lengths of the two long branches improbable -- it could prefer to support a tree that had them partly joined, and thus they could each be shorter and thus closer to what the prior expects.

What is less obvious is why this LBA does not go away as longer and longer sequences are simulated. I can see that it might not if the true tree was a star tree, as in Figure 6 which Joe Thornton in the previous comment.

But I would think that in a true tree that is resolved, the expected log-likelihood curve would be higher for the true topology, and with lots of sites the expected difference in log-likelihood would get big, and that would cause the posterior distribution to be more and more concentrated within the true topology. However in Figure 1 (say row a) I don't see that happening.

Joe, am I missing somewhere where you explain this?

Joe raises an interesting question. We directly ad...

2010-01-12T14:06:42.496-05:00

Joe raises an interesting question. We directly address it in our paper in the last two paragraphs of the Results section and in Fig. 6.

Joe suggests that 1) the likelihood function over branch lengths for each topology becomes more peaked as sequence length grows, so 2) one would expect Bayesian inference (BI) to converge on the true tree as the amount of data analyzed grows. The first part of that sentence is true, but (perhaps counterintuitively) it isn't relevant to the second part, which turns out to be incorrect. The relative support for one tree over another in BI is determined by the ratio of the integrated likelihoods for the two topologies (modified by the priors). As the quantity of data grows, the likelihood function does becomes more peaked, but it does so for every topology, and the ratio of the integrated likelihoods (and therefore of the posterior probabilities) for different topologies actually grows more extreme. If there is systematic bias at short or moderate sequence lengths, it will become stronger as more data are added.

The likelihood surfaces over branch lengths, which we numerically characterized in Fig. 6a, illustrate this behavior. When the data are generated on a star-tree with Felsenstein zone branch lengths, there is more volume under the likelihood surface of the LBA tree (red lines) than that of any other tree. Although the peaks on each surface have identical height, the surface for the LBA tree has more butte-like contours, whereas the others have steeper slopes and lower integrated likelihoods. Although the true topology/branch-length combination has a higher point likelihood than any other combination, the LBA tree has a higher likelihood given most other (untrue) branch lengths than the other trees do. As Fig. 6b shows, this is because the convergent state patterns that arise frequently on non-sister long-branches support the LBA tree when incorrect branch lengths are assumed. Thus ML correctly infers equal support for all topologies, whereas BI prefers the LBA tree overall.

Now compare the upper to the lower plots in Fig. 6a. As the amount of sequence data grows, the likelihood function indeed scales so that the contours of each surface become more extreme -– the highs get higher and the lows get lower –- but their basic shapes do not change. Although all the peaks are skinnier, the LBA topology continues to have a more spread-out likelihood surface with higher volume under it than the other topologies do. In fact, the ratio of the volumes under the curves actually becomes more extreme, and support for the LBA tree increases (see Fig. 6c).

The analytical work described in the last two paragraphs of the Results section shows that this behavior is in fact theoretically expected. The integrated likelihood ratio of the LBA tree versus the other topologies will scale linearly with the number of sequence sites analyzed. Because the bias is caused by integrating over branch lengths -- a fundamental feature of fully Bayesian approaches -- it represents an intrinsic problem, not an implementation flaw.

We did this work and included it in the paper because Joe raised this same comment to us after reading an earlier version. We were grateful for Joe's comment because the work it stimulated showed decisively that the observation of increasing bias with increasing sequence length is a necessity, not an artifact.

I wonder. If the amount of data is large, Bayesi...

2010-01-11T07:35:50.611-05:00

I wonder. If the amount of data is large, Bayesian methods should result in a posterior distribution that is strongly concentrated around the ML tree. And the concentration should get stronger with more and more data. So Bayesian inference, when used to make a point estimate drawn from any reasonable place in the posterior, should be consistent.

So I would put my 25 cents on this dramatic result being an artifact. I have lots of arguments against being a Bayesian, but this isn't going to be one of them, not unless someone can explain to me why it would be expected to happen.

(I was shown an earlier version of this paper, and explained some of my doubts to the authors, but then I got too busy to follow up on it. Can one of you do my work for me and give me a simple explanation?)

Very nice piece, Jonathan. Thanks for pointing th...

2010-01-07T08:17:02.925-05:00

Very nice piece, Jonathan. Thanks for pointing that out.

I wrote a bit about this here and also included ...

2010-01-06T16:52:07.503-05:00

I wrote a bit about this here and also included a mini email interview with one of the authors

The susceptibility of ML and Bayes to LBA has also...

2010-01-06T05:56:08.999-05:00

The susceptibility of ML and Bayes to LBA has also not gone unnoticed by the developers of PhyloBayes. We've found that PhyloBayes, which accounts for some artifacts in large data sets, correctly accounts for LBA by demonstrating low support in areas of the tree where ML and Bayes actually gave high support.

The following paper may be of interest: MBE 2009 - Parasitism and mutualism in Wolbachia. PMID: 18974066