Friday, April 3, 2009

When MrBayes Fails...

Last summer, Hackett et al. published a widely-read study of phylogenetic relationships among major bird lineages based on 19 independent loci sampled from 169 species (see also Tom Near's previous post). Their study confirmed some patterns suggested by previous phylogenetic studies (e.g., ratites + tinamous as sister to remaining bird species) while also recovering some novel patterns (e.g., passerines sister to parrots [albiet with low support]). One of the more interesting results from their analyses, however, was relegated to the on-line supplement. In this supplement, we learn that all eight of the 10 million generation partitioned analyses they ran in MrBayes apparently failed to reach stationarity (see figure; note that the first 2 million generations are inexplicably trimmed from each analysis as 'burnin-in'). Unpartitioned analyses fared even worse, resulting in immediate crashes "regardless of the memory capacity of the computers used."

Among the partitioned analyses, some continued to shift to new areas of the likelihood surface until relatively late in the analysis. Perhaps even more troubling though was the fact that analyses that did appear to reach a stable plateau sampled significantly different likelihood scores (e.g., -lnL -861,000 v. -lnL 859,500). Is this problem unavoidable in analyses of large datasets?

The most obvious solution would be to simply run the analyses for more than 10 million generations. I've certainly had analyses that required more than 10 million generations to reach stationarity. Perhaps this wasn't done because it took two months on a super computer to run the 10 million generation analyses (anybody know if Hackett et al. or others have implemented longer runs since their paper was published?). Another possibile solution to their problems is to modify the parameters of the MC3 analyses implemented by MrBayes (recall that the MrBayes default is to run two independent MC3 analyses with one cold chain and three heated chains). Hackett et al. explored this possibility by running six analyses with one heated chain and one cold chain (B1-B6) and two analyses with six heated chains and one cold chain (A1-A2). The analyses run with multiple heated chains performed significantly better than those with a single heated chain, perhaps due to the fact that multiple chains are incrementally heated by MrBayes (meaning that the fourth of six heated chains has a flatter likelihood surface than the first). Hackett et al. do not discuss the temperatures used for the heated chains in their analyses, but their results suggest that running multiple heated chains in a single analysis is superior to repeatedly running analyses with only one heated chain.

In any case, Hackett et al.'s ultimate solution was to discard all of their Bayesian analyses and rely instead on parsimony and the fast maximum likelihood methods implemented by GARLI and RAxML. Is this shift away from Bayesian inference in favor of fast maximum likelihood searches for computational reasons a sign of things to come (or has this shift already occurred)? Are the fast maximum likelihood methods ready for prime time, or do people remain uncomfortable with the shortcuts they use to acheive their apparent computational efficiency?

15 comments:

Todd Jackman said...

Also troubling for me with respect to Bayesian analyses is Ziheng Yang's 2008 paper:
Phil. Trans. R. Soc. B (2008) 363, 4031–4039
Radically different trees with PP of 1 on almost every node result depending on the model selected for large data sets in whales.
Although he doesn't see the problems as general for Bayesian analysis, if really big data sets drastically differ in trees produced depending on assumed models and the only way to fix it is to arbitrarily fiddle with priors, then likelihood trees seem like a better alternative.

From the discussion:
"In both the small and large datasets, the data size-
dependent prior almost always led to reduced PPs for
the MAP trees. It thus appears useful in reducing the
apparent conflicts in Bayesian phylogenetic analysis.
The implementation here involves some arbitrariness."

Dan Rabosky said...

Interesting post, Rich; and Todd, thanks for pointing out the Yang paper - I missed that.

The issue with stationarity of likelihoods just scratches the surface of the "convergence" issue. I've analyzed datasets that very quickly reached apparent stationarity, and for which independent runs reached the same stable plateau. But in fact, within individual runs, shifts between islands in tree space occurred approximately every 30-50 million generations. This suggests that a valid sample from the posterior would probably require billions of generations.
This would be very difficult to detect without tools like AWTY
, of which I am a huge fan.

As an aside, I haven't found TRACERs diagnostics as useful as AWTY, because AWTY explictly looks at the one parameter we care about - topology - whereas TRACER looks at all the molecular evolutionary parameters, and a valid sample of those does not necessarily imply a valid sample of trees.

Unknown said...

Hackett et al.'s solution was to revert back to parsimony??!?! That seems a bit archaic.

Susan Perkins said...

Oh my goodness - I think I might need to print out Barb's comment and post it to my door.

Good post, Rich - I feel like I'm constantly making comments about the number of generations run when I'm reviewing manuscripts and clearly the issue of how long runs need to be and what the potential for reaching true stationarity is has not been adequately explored yet.

Unknown said...

I think people have shifted to fast ML methods for large datasets because of practical reasons (simply impossible to run extremely large datasets on MrBayes). In fact, I think many people switched to Bayesian methods years ago for practical reasons as well (getting PP's was much faster than PAUP bootstraps), ignoring or adjusting to the philosophical differences in the approaches.

In my experience, RAxML especially, is ready for wide use. I have found very good performance. But the larger question users should be asking is whether they are looking for the ML tree with attempts at the confidence interval (using some technique) or are users looking for samples from the posterior distribution of trees. Important questions and consequences that require users to not treat these as blackboxes.

Interesting stuff.

blog link

Dan Warren said...

Back when we were first testing AWTY's predecessor (Converge) on a bunch of different data sets, we saw this sort of thing happen with posteriors in some of the larger data sets. The most egregious offender was only ~85 taxa, and we found things that looked like this going out to 40 million generations or more. Worse yet, even after the posteriors for the separate runs flattened out, we were left with substantial disagreement between runs on the PPs for many clades. We did ten separate runs of over 50 million generations each, and the results split perfectly into two sets of five runs, each of which was internally consistent but had substantial disagreement with all of the runs from the other set. That's about the worst news you could get out of MCMC, because it means that if you just did two runs there's a 50% chance you'd get results that no longer showed radically changing posteriors and that were consistent between runs. Most people would call that a good result, and yet the truth was that the chains were nowhere near convergence.

That was our poster child for misbehaving data, though, and we had other data sets with more species that were nowhere near that bad. Just to check out what was going on we did a bunch of random addition sequences followed by likelihood searches and found that there were just a buttload of local optima, and that the top two were not that different in likelihood. Clearly we had chains getting stuck on each of those optima for a long time.

I feel like I should be telling this around a campfire with a flashlight under my chin because it's basically the equivalent of a spoooooky story for Bayesian phylogenetics people.

Dan Warren said...
This comment has been removed by the author.
Matt said...

Nice to see this discussion. Another horror story, the perpetrators will remain anonymous. For some time a ~14 gene analysis for several hundred taxa has resulted in 100% posterior probability of a (very) novel placement for a taxon of note. This result was reported at various meetings etc. etc. Turns out for the taxon in question a single gene (1/14) was in fact a contaminant that wasn't caught. Now this might not have been a problem specific to Bayesian analyses, but it's definitely scary (suggesting a tiny amount of error can swamp analysis).

blackrim- How are you measuring performance?

barb- Easy now, parsimony definitely has it's place. Among other things you're doing nothing fast without it for the fast ML methods on large datasets (these generate a parsimony tree as a starting point). It's also the only method that can handle truly massive datasets, witness the 70k terminal analysis done by Goloboff's team presented at the last Cladistics meeting.

DanS said...

Shouldn't we expect to see this problem with large, multigene (=more paramters) datasets with star-trees?

Isn't it possible that the figure is showing different parameter estimates for a similar set of trees? Put another way - is it possible their analyses found the same topologies, but the large number of parameters complicates the analysis?

As far as the scary story goes - I thought it was widely recommended to do at least 4 independent runs for any dataset?

Dan Warren said...

It may be common practice now to do multiple independent runs, but it sure wasn't the norm back then.

Glor said...

Good stuff, y'all. Beware that Dr. Moore is preparing a response that is tentatively titled "Attention Phylo-whores"!

Rob Lanfear said...

I've worked a little bit with similar datasets, and seen similar problems. It seems that MrBayes often doesn't find good topologies in reasonable time for large (>100?) numbers of taxa, even with lots of chains etc.

One way around this is to start the run with a ML topology (e.g. from RaxML or Garli) with a small number of perturbations (e.g. nperts=5 in MrBayes). As long as the MCMC is behaving, this should still be a valid way to sample the posterior. If combining data from multiple runs, you'll need to leave enough burnin to make sure that the start points are as independent as possible (you can see this happening by tracking the stdev of split frequencies, which starts off rising as runs diverge, then starts dropping again).

Any thoughts?

Salva said...

I agree with with Stephen (blackrim) comment, most people change the method because their speed instead of their philosophy of each approach (beyond the "archaic" nature of the approach, :P)

BenRI said...

First, I sympathize with the suspicion about whether MrBayes is converging. Current Bayesian tree proposals are somewhat "stupid" in the sense that they are data-agnostic, and also may not update branch lengths to match proposed topologies. This is easier to do in an ML setting.

That said, the obvious question here is "If MrBayes fails, how do we know if the other methods succeed." Does anyone know if the ML or MP searches that the authors used succeeded? (e.g. did multiple independent runs give the same answer?)

Clearly, it is not great practice to say "We checked this method, and it failed, so we used another method which we didn't check." :-)

-BenRI

Glor said...

For those who remain interested in this thread, John Harshman (one of the authors of the Hackett et al. paper) has weighed in with some thoughts in the comments to another post that involved this paper.