As a friend asked yesterday, why aren’t more folks using AWTY to assess convergence? As far as I am aware, this is the only general diagnostic tool out there that is geared towards assessing convergence of *topologies*, rather than convergence of molecular evolutionary parameters. As such it, is explicitly addressing the one set of parameters that are (usually) of greatest interest to systematists: the tree itself.

With this in mind, I conducted an informal survey of convergence diagnostics in the literature. I looked at all articles published in two recent issues of Molecular Phylogenetics and Evolution using Bayesian inference (mostly MrBayes, but a few using BEAST) and tabulated the convergence diagnostics used. MPE seems like it should be a reasonable gauge of methods currently used by practicing systematists, although it would not surprise me if papers in some journals (Systematic Biology?) use convergence assessment that is, on average, more rigorous. Anyway, in 25 studies:

3 studies reported only that they “examined stationarity of LnL values” or something to this effect. I hesitate to say that this is the worst possible test of convergence, because 5 studies reported no test of convergence whatsoever and another tested for convergence by ‘discarding burn-in’. I think most readers of this blog would agree that these are generally not adequate.

The most frequent class involved some variation of analyzing multiple runs (11 studies); this includes checking the standard deviation of split frequencies for independent runs (6 studies) and comparing posterior probabilities for independent runs (2 studies). I think this is a good general strategy, but the majority of these considered only 2 independent runs. This is not good. Let’s imagine that treespace for your dataset contains two (rather different) topologies of high and equal probability. At convergence, your MCMC sampler should visit both of these topologies in proportion to their posterior probability (say, ~47% of the time for each, as no other topologies are nearly as good).

A major problem arises if it takes many generations to move between these topologies. Even for a highly-optimized MCMC sampler, it does not surprise me at all that it might take many millions of generations to move between “distant” regions of treespace, particularly for large datasets. If this is true, two runs is far from adequate, because there is a 50% chance that two independent runs will find the same high-probability topology first, and – if not run for a sufficient number of generations – it will appear as though the runs have converged, based on both similarity of posterior probabilities, standard dev of split frequencies, etc. This is something of a worst-case scenario, because – as I’ve described it – certain clades would appear to have ~1.00 posterior probability, when in fact the true posterior probability might be closer to 0.5. Unfortunately, there appears to be no good way of determining a ‘sufficient’ number of generations a priori, so the only solution here seems to be ‘lots of runs.’

The next most frequent strategy involved checking convergence of molecular evolutionary parameters, either by estimating effective sizes of parameters (6 studies), or by checking the Gelman-Rubin proportional scale reduction factor (1 study). I am skeptical of these approaches, considered alone, because I’ve found that there is often little correspondence between convergence of molecular evolutionary parameters and convergence of topologies. Perhaps this reflects some particularly troublesome datasets that I have worked with, but it does not leave me feeling encouraged. Note that I am not claiming that monitoring these parameters is unimportant, but that it is fundamentally inadequate with respect to our interest in topologies.

## 33 comments:

I'm with you on the gravity of these problems Dan. People don't run enough searches and don't do a good job assessing convergence for those runs they do complete. I think one of the main reasons why people don't use AWTY is because it is difficult to understand the meaning of the output. I've tried to get the word out with several previous posts ([1], 2, 3). We will have another post soon that will help people understand exactly what the "average standard deviation of split frequencies" (the convergence diagnostic referenced in the MrBayes manual and output when two independent MCMCMC analyses are run simultaneously) means.

I agree. We have to take into consideration that many of MrBayes users run analyses using default parameters, hence the high incidence of two simultaneous runs (Nruns=2 by default) and the reporting of the number of split frequencies and PRSF (because they're part of the output) rather in a mechanistic fashion. I am not referring to the last issues of MPE, but rather to a general feeling I got since MrBayes v2 was released.

It'd also nice to talk about autocorrelation and ESS. I'd also like to add to the list the exploration of heating schemes involving chain temperatures and number of chains. I usually have to tweak the default t=0.2 to lower values to get good mixing. While I'm on it, other heating schemes are implemented in Jody Hey's IM/IMa incl. linear, 2-step, 2-step adaptive, and geometric.

I think the problem may be that, as far as I know, there isn't a general paper explaining the deficiencies of lnL plots and SDs and why AWTY is the best way to go. I've tried to be an advocate of AWTY in my papers and in discussion with colleagues, but I wish I had a paper I could point to and say "read this."

From Dan R.'s review of the literature and my own experience, I think it is obvious that many researchers simply do not understand the importance of convergence. I suggest using AWTY in literally every paper I review (usually with a brief explanation of why looking at clade posterior probabilities, rather than other parameters, is the way to go). However, frequently the paper is published without any changes to the convergence diagnostics despite my suggestion (making me think both the author and editor may not think assessing convergence is a big deal).

I forgot to post a link to the AWTY citation: Bioinformatics (2008)

Matt- I agree. I think it is not immediately obvious that one consequence of convergence failure is, potentially, dramatically inflated posterior probabilities. Maybe there is a perception that failure to converge can give some cruddy parameter estimates, but that the big picture - the tree and clade frequencies - should not be especially affected?

The initial idea when we started out with AWTY (then Converge) was that there would be a big paper wherein we would re-analyze a bunch of data sets and show how badly you could mis-diagnose convergence using -lnL plots. A combination of factors acted to prevent that from happening, not the least of which being that I left for grad school. It's a little surprising that people are so cavalier about it still, so maybe a paper like that is still necessary. Anyone have a cluster they can spare for a year?

I've been using Tracer to look at MrBayes parameter outputs. What is the advantage of AWTY over Tracer? Is it the ability to compare clade probability values from independent runs?

I see 2 main advantages:

(1) Comparing posterior probabilities for independent runs. If you have true convergence, then you should be able to demonstrate that these are similar for many independent runs. This is quite different from looking at effective sample sizes for molecular evolutionary parameters (tracer).

(2) the ability to monitor convergence within a run using the sliding window / interval method with non-overlapping samples (the slide command). This tests whether subsamples of your chain are also sampling trees in proportion to their posterior probability. If you divide your chain into 5 or 10 subsamples, and compute clade frequencies, they should be similar if you have convergence. At finer scales, this of course breaks down, but at the coarsest scales, it should be a good guide. If you use 5-10 intervals and you see clade frequencies wobbling erratically, this is indicating that your subsamples of trees are not describing similar distributions.

Again, this is looking at the sampled trees rather than molecular evolutionary parameters, in contrast to tracer.

I think these are best used in concert. Using (2) alone is not OK, because a flat trajectory of posterior probabilities across a sample could be caused by the fact that the sampler is simply wandering around one particular island in tree space. Multiple runs seem to be the only way to evaluate whether this is occurring.

I think some folks are uncomfortable with the fact that these are more or less 'rules of thumb': how different do the posterior probs for runs have to be before you should be concerned, for example? I'd start to get worried if it exceeds 5%....

And just to add to what Dan R. is saying about tracer: I assume most people use tracer to look at lnL plots over time. The problem with this method, as others have pointed out already on the blog, is that most of us are interested in estimating the posterior distribution of topologies, so why should we use lnL as a proxy for topology when we could just look at the the actual posterior distribution of topologies instead? This is what AWTY can do.

Or, to use an example, if you had "messy" data with a really complex likelihood surface, you would unlikely recognize this by looking at lnL plots alone. You may be prone to stop the analysis before you've adequately explored tree space. I've seen multiple examples with my own data where lnL plots flatten out, but clade posterior probabilities are still changing. In other words, if I stopped the analysis, I would be severely overestimating support for "wrong" clades or underestimating support for "correct" clades.

Personally, I use Tracer / lnL plots as a rough guide to assess how my analyses are behaving, but I don't publish anything without assessing convergence with AWTY.

The problem with diagnosing convergence with lnL scores may not necessarily be limited to complex likelihood surfaces. If you picture a likelihood surface for trees with 10 tips and one for trees with 100 tips, the difference isn't just in the total number of trees; there are also more trees adjacent to any given tree for the data set with more taxa.

For the sake of argument, let's say that each of those likelihood surfaces only has one peak, and that those peaks are equally easy to get to for MCMC so that we can get from our starting point to the best tree in the same number of steps regardless of the number of taxa. Getting to the peak is just the beginning, obviously - to estimate the posteriors of clades for those two data sets, we need to sample some of the trees adjacent to the peak. I don't know what proportion of those nearby trees is the magic number that gets us reliable posteriors, but let's assume there is one, and that it's the same for any number of taxa. What that means is that we'll need to have a much longer chain for the 100 species than we will for the 10 in order to reach that magic proportion, regardless of how long it took us to get to the best tree. And the punch line is that failure to converge for this reason is going to be invisible in your likelihood scores - you're going to be bopping back and forth between trees with similar likelihoods the whole time.

I don't know if I've failed MrBayes, or vice versa, but I am about to throw in the towel with a large (but by no means enormous by today's standards) dataset: 116 taxa by ~10,500 bp divided into seven partitions, both nuclear and mt. AWTY was a revelation to me--particularly its sliding window analysis that Dan R. talked about--and it's clear that this dataset is nowhere near convergence after five days running on the fastest cluster available to me: 3.4 Gigasmertz Xeons with 2gigs ram each. I can't run anything longer than 5 days without special permission and frankly think that five days is about my limit, too. I have tried reducing the default heated chain temp to 0.175 and 0.15 and have used starting trees and different partitioning schemes hoping speed convergence, all to little advantage. On the other hand, partitioned mixed-model ML analysis in RaxML using takes a little over an hour on my MacBook and I can bootstrap it overnight. I've got better things to do than wait weeks for MrBayes.

Hi guys,

Personally I agree with all Dan and others say. I use AWTY and "promote" its use to my colleagues. This is all great when you have small datasets. As one post just indicated (A. Horn), what happens when you have large ones? (ca. > 300 taxa and > 10 000 characters). Obviously running analyses for 1 year is not really possible...

What do people think of this method posted on the cluster website of BEAST at Cornell:

"For example, if instead of specifying just one task you request, say, 6 of them, then after 5 days alloted to the job you will have 6 statistically independent sets of log files, tree files, etc., which you can download and combine (using, for example, LogCombiner and TreeAnnotator on your local machine) for further analysis. This way you can perform 5*6 = 30 days worth of BEAST simulation in just 5 days. This parallel method of running BEAST is preferred over running a single very long job on one processor. First - it gives you the results much faster. Second - compute clusters are usually on a mintenance schedule and typically they need to be rebooted once in a while, so that getting a uninterrupted long periods of computer time is rather difficult."

So for large datasets, I basically have 60 runs (3*20) of ca. 2-5 million gens, and then combine them. By doing this you can get to reasonable ESS values and have runs of ca. 50 - 100 gens long. I also use a ML starting tree (RAxML). In general the lnL of the tree does not change much, mainly because it is already near optimal. Basically I have a short burnin period. The same goes for most parameters. Would this be usable in AWTY (several short runs analyzed as one long one)? I have seen this way of proceeding in PNAS a few months back.

I am curious to know what people think of this on this blog.

The concern I have with conducting multiple runs and combining them is that it seems to me as though it assumes each run has reached convergence. What if you have two local optima, one of which is inferior to the other, but both of which are equally difficult to get away from? In that case, you end up with half (for the sake of argument) of your runs stuck on one optimum and half from the other. When you combine those runs, you weight the topologies from those two optima equally, despite the fact that the true weighting would be less egalitarian.

I just mentioned this idea to Matt Brandley the other day: it seems like you can possibly rescue this sort of multi-run approach if you were to construct consensus trees where you weight each constituent tree by its likelihood. Ideally an MCMC chain that's reached stationarity should already have the property that trees are present in frequencies proportional to their posterior probability, but of course that's exactly what we can't guarantee. Failing that, it seems to me like you might be able to fake it using this sort of weighted consensus approach, and that you might even get better results - 20 runs of 5 million generations might well do a better job of sampling the entire space than would 1 run of 100 generations.

This is just me making stuff up, though. Anybody know whether anyone has tried such an approach?

I agree about the local optima, and what I generally do is scan all runs and check to see if they all have the same lnL for the tree. They also appear to have very similar values for all other parameters. In my datasets this is normally verified. Because my starting tree is already quite good all runs seem to converge towards the same value and very fast, even with a smallish number of generations. When I look at my AWTY analysis of the pooled runs I see a good result. Of course I cannot compare it to anything, but the analysis of individual runs (function compare) appears to be ok.

In the case of several local optima, dare I ask why don't we just delete the runs with lower tree lnLs?

sorry not the compare function, but the cumulative one

In order to get reliable estimates of posterior probabilities we need to explore tree space thoroughly, not just sample a bunch of trees from near the global optimum.

Given the procedure that you've outlined, it's not at all surprising that you get rapid apparent convergence - you're starting at or near the optimum already, and in order to sample any of the rest of tree space you'd basically have to walk downhill quite some ways. If you just start on a peak and stay there, you're going to get high posteriors pretty much no matter what. That's not the same as saying that you have converged on accurate posterior estimates, though.

This is the central difference between how MCMC Bayes is supposed to work and how a heuristic search works - we're not just trying to find the best tree and stick there, we really have to get around in order to make accurate posterior estimates. If an MCMC chain was allowed to run forever, ideally every possible tree would be represented in that chain in proportion to its posterior probability. Obviously we're never going to do that, but that

isthe distribution we're trying to estimate. We can sometimes get away with not exploring tree space really well in cases where the surface is relatively simple, as most of the trees that aren't near the peak are going to be so crappy that they won't factor into the posteriors much anyway. If you have two nearly-equally-good optima, though, you really should not just take the trees from the top of the best one. That will inflate estimates of posteriors far beyond their true level of support.The fact that the model parameters level out isn't strong support for convergence in the scenario you've outlined either - if you're visiting only very similar trees (i.e., stuck on a peak), you're usually going to estimate very similar parameter values as well. So you'll get high posteriors and low variance in parameter estimates and yet still have a failure to converge.

In general, that's one of the things that I think really isn't appreciated enough in MCMC - failure to adequately explore the space results in apparent high confidence. As scientists, we really like to see large posteriors, and we are biased towards interpreting them as indicating that we've got a nice clean result rather than seeing them as a possible indicator that we haven't finished our analysis.

yes, I agree with all that, and I am also fully aware that I am using a short (bad) cut (and I won't even go into the "to partition or not to partition" debate for these large datasets). However, on a an empirical basis the results I get are very much in line with results from eg max pars or ML. Yes, the support values are higher for certain nodes, but well known unsupported nodes are returned as unsupported. It is not as if all of sudden I have a fully resolved tree. But I guess we can discuss this for days. I just wanted to raise this question. Thanks for the insights, it helps a lot.

For those who might be closer to the "Bayes search drivers" is there any hope in the near future of an improvement in MCMC algorithms for large datasets (make the searches quicker and more efficient)? A bit like what we have seen recently in ML tree search (GARLi, RAxML, PHYML etc). Otherwise, Bayes analyses (for large datasets) might just land where it started several centuries ago: nice but too long to do. Maybe a good question to ask John Heulsenbeck during the interview.

I think Dan W. is absolutely correct and, just to reiterate, (depending on the complexity of treespace for your particular data set) if you start all of your MCMC analyses on the same tree space "hill", then you are more-or-less doing a ML search and not actually estimating the true posterior distribution of trees (unless, as Dan pointed out, your treespace is very very simple).

Simply put, if you do this, the posterior probabilities you get out of that analysis may not be "true" posterior probabilities, and thus, not accurate measures of clade support.

As for speeding up MCMC, I also think this would be a great question for John H.'s interview. It's my limited understanding that many of the clever ways that RAxML and GARLI speed up ML searches are not applicable to MCMC, but surely there are some new MCMC developments on the horizon.

On the big data set convergence problem, hat about this notion?: Begin many MCMC runs, each with a different random starting tree, and combine. It seems to me that with enough such runs you would indeed sample points in treespace roughtly in proportion to their likelihoods, even if mixing wasn't all that good, simply because each run would in general make tracks toward the closest/steepest local optimum. You would probably oversample somewhat the region immediately around each starting tree, but intuition tells me this would not seriously affect any node posteriors. Picking a number out of my nether regions, a hundred runs of a million generations might perform much better than 5 runs of 20 million.

You run into exactly the issue I was talking about, uh...seven posts(?) up. Ignoring burnin on each of those chains, if they get stuck on local optima you will be combining trees from those local optima out of proportion to their true posterior probability. Really the only way I can see to do a non-weighted combination of multiple runs is if you think each chain has independently converged on the right answer, and if you think that the question then becomes why you would bother.

I'm not at all sure that the weighting idea I was floating is worth much, but a non-weighted approach that combines multiple short runs could definitely be problematic.

That's why I suggested a large number of starts. It seems to me that random starting points would be attracted to local optima roughly in proportion to their size. Or, thinking of it another way, it's more or less equivalent to an occasional proposal of a radical change in topology. Would that sound better?

To use the good old tree space metaphor: I think it works if you can assume that the width of a peak is proportional to its height, but I'm not at all sure that that's generally true.

Just in case the analogy isn't clear I'll try to put my concern a little more in the right terminology - if you can assume that the posterior probability of trees at and near a local optimum is proportional to the frequency of random starting trees that lie in a region of treespace that, were you to start with that tree, would lead to the optimum in question - then I think you could believe the posteriors obtained from the kind of approach you're talking about. Failing that, I think you're going to end up weighting trees inappropriately.

Thank you all for this discussion!

It is comforting to see I am not alone in trying to tackle poor convergence in Bayesian phylogenetics.

Unfortunately, Bayesian philosophy aside, I think that many of the suggestions on how to deal with the problems of poor convergence are based on the personal experience of the researcher and in turn also very much dependent of the data we are used to deal with.

In my experience with nuclear ribosomal data from ancient invertebrate groups, I have found LnL, PSRF ("Potential scale reduction factor") and "Average standard deviation of split frequencies" to be more or less equally unsatisfactory predictors of topological convergence among runs. The latter measure seems to popular these days, but doesn't this method really only compare posterior probabilities for clades that are found in majority in *both/all* runs and skip the conflicting nodes, which are really at the heart of the matter? Couldn't we somehow compare branch swap steps or similar between trees within a run to trees in other runs to detect such conflicts?

The default temperature of 0.2 in MrBayes also results in horribly poor mixing of about 1-2% or even less for my analyses and I need to lower it something like 0.05-0.1 to up the acceptance rate. I agree that the alarm bell should go off as soon as you notice well supported incongruent clades among clades. These usually appear early on for me and I am really never able to get my runs to integrate well over the islands later on no matter what temperture I use...

The problem is of course that there are so many parameters to tweak and that empirical datasets demonstrating these problems in the systematics field tend to have too many taxa to allow for much exploration within reasonable times. Should I change the substitution model, number of chains, temperature, proposal mechanisms, number of attempted swaps per opportunity, number of generation between swap opportunities, add topological heterogeneity parameters such as the covarion model and so on?

Sometimes on particularly rainy Mondays

I wonder if I am even able to escape the topology and inherent artifacts imposed by the uber simple distance tree constructed at the alignment stage...

... and then I discover some contaminated sequence from an unrelated taxon in my partitioned dataset that might explain some peculiarities of my analyses and remove it...

... only to discover that correcting it does absolutely nothing for solving said peculiarities and that MrBayes was apparently able to shoehorn that taxon into its position without much trouble at all.

Well... on sunny Fridays I usually recover and regain some strength and hope. Here is to running MrBayes4 or Bali-phy on a simple PC with a couple of gaming graphics cards in a few years and out-beat most of todays clusters!

Oops...

It should of course read:

"I agree that the alarm bell should go off as soon as you notice well supported incongruent clades among *runs*".

Not "clades".

Being a Bayes user and often having to defend Bayesian topologies against old-school "Paupists" (even in papers where both Likelihood and Bayesian are included and congruent) I am happy to find new ways not only to defend my results, but also investigate the methods themselves (I am under no illusions that Bayesian or any other method is the single magic bullet).

AWTY? is certainly a useful and interesting tool, and I have discovered, to my surprise, that my topologies do actually appear to converge (this is no mean leap of faith!).

However rather than describing that "compare command plots strictly conform to a line with gradient 1 by eye" (excuse my as yet sloppy wording), has anyone statistically tested the fit of the data to the 'expected' and used the P value to support the observation? I can imagine that saying "the data strongly conformed to topological convergence with P=1.00" would be easier to sell.

Secondarily to this, a single outlier would suggest non-convergent topologies (and grounds for rejection of convergence) but would this single data point be enough to reject statistical significance of such a test?

The takehome message I get here is that somebody really has to do extensive experimentation with search parameters on a number of different, very large data sets (say >100 taxa, >10000 sites as a first approximation), just to see what best produces topological convergence and how long it takes.

Not it.

I'm suspecting that most model parameters other than topology are not of great interest to most users. My general impression is also that optimal model parameters do not change a great deal with changes in topology. It's been a while since I've used MrBayes, but can you easily adjust the balance of topology proposals to other proposals?

It just occurred to me that it might be nice to vary the ratio topology/model proposals over a run. Early in the run, you might want a lot of model variation until you reach a general region of a model peak, and then a lot of topology variation thereafter. This assumes low interaction between model and topology, such that there is not a bimodal distribution with topology A being great with model B, and topology C being great with model D.

I for one would love a small AWTY tutorial. Like Glor said, its difficult to understand the output. The docs are spotty or nonexistent at best. I understand what most of the metrics are but several are lost on me.

A year and half later this thread is still awesome and useful. I hope you guys all have bigger clusters now and can run some larger data sets. :)

XX Dee

I just discovered this phylogeny blog, and there are pretty interesting discussions.

I'd like to ask a question related to Dan Warren's comment of May 20, 2009 2:50 PM:

This is the central difference between how MCMC Bayes is supposed to work and how a heuristic search works - we're not just trying to find the best tree and stick there, we really have to get around in order to make accurate posterior estimates. If an MCMC chain was allowed to run forever, ideally every possible tree would be represented in that chain in proportion to its posterior probability. Obviously we're never going to do that, but that is the distribution we're trying to estimate. We can sometimes get away with not exploring tree space really well in cases where the surface is relatively simple, as most of the trees that aren't near the peak are going to be so crappy that they won't factor into the posteriors much anyway. If you have two nearly-equally-good optima, though, you really should not just take the trees from the top of the best one. That will inflate estimates of posteriors far beyond their true level of support.Let's suppose we're in the case of two nearly-equally-good-optima.

A well behaved Markov chain should sometimes jump between the region of optimum 1 and that of optimum 2.

And this should happen not too rarely: too infrequent jumps would bias the posterior probabilities estimates. Am I right?

Suppose we run 10 chains, and suppose that 5 of them will stick around optimum 1, and the five other will stick around optimum 2.

1) What diagnostics are more likely to detect such a situation when it exists?

2) What can we do to fix the problem?

Is using a lot of coupled chains, over a wide range of temperatures likely to solve the problem ?

Dear all,

This post and the following discussion have been most informative. However, I have a suggestion similar to the combination approach discussed by Thomas Couvreur and I wondered if anyone could spot any problems with it.

In my experience mrBayes runs a lot faster if you use it with nruns=1 - in some cases nearly 50% faster. So you could then run multiple mrBayes tasks on a cluster, each as an independent run using the same MCMC conditions but a different random seed each time.

Then using Tracer and AWTY you could check for convergence between runs prior to building a consensus tree. The difference here is that you do not combine runs, but treat each one as a separate entity. For me this method seems to make a lot of sense because it gets over the default of nruns=2 and instead strengthens the convergence assessment.

The problem is that with nruns=1, mrBayes no longer produces the standard deviation of the split frequencies. But from what I can understand, the AWTY outputs provide a similar method of assessing the runs and indeed allow the user to assess many more runs at once. I would however really appreciate it if anyone can see a glaring error in my logic and let me know!

I may be missing the obvious here - I am not a phylogeneticist and I am still relatively new to mrBayes. But as far as I can tell, this method allows you to run larger analyses much more efficiently and improves the rigour of the convergence estimate.

HI,

people like myself are still (2014) eagerly reading this informative thread.

Will some one please answer Mark (last question posted 2011).

Thanks for all the comments

Post a Comment