A recent Dechronization post highlighted the unsuccessful attempts at Bayesian estimation of a large-scale bird phylogeny based on a multi-locus data set by Hackett et al. The apparent failure of MrBayes in this particular case (and under similarly challenging inference scenarios associated with large and/or complex data sets, e.g., Soltis et al., 2007; Moore et al., 2008) appears to raise serious concerns regarding our ability to estimate large-scale phylogeny using Bayesian methods.
However, it is important to carefully consider precisely what such studies have actually demonstrated: that Bayesian estimation of phylogeny appears to be intractable for certain data sets using default settings implemented in a particular program, MrBayes. Unfortunately, these anecdotal observations have led some researchers to a nested series of increasingly dubious and unsubstantiated conclusions. First, that it is impossible to reliably estimate phylogeny for this particular data set under any settings implemented in MrBayes, and more generally, that it is impossible to reliably estimate phylogeny for this particular data set not only using MrBayes but using any Bayesian methods, and finally by following this false premise to its ultimate conclusion, that it is impossible to reliably estimate phylogeny not only for this particular data set, but for any large-scale data set using Bayesian methods.
Although Bayesian estimation of phylogeny appears to succeed for the vast majority of empirical problems, there remain inference problems for which Bayesian estimation is apt to be intransigent, which may be usefully divided into three categories: (1) inference scenarios in which reliable Bayesian (or any other) estimation is likely to be problematic (e.g., whole-genome alignments for extremely large numbers of species); (2) inference scenarios in which rigorous application of existing Bayesian methods are apt to fail; and (3) inference scenarios in which imprudent application of existing Bayesian methods using default settings are apt to fail. I believe that the vast majority of reportedly “impossible” Bayesian phylogeny estimation problems fall within the latter two categories.
Existing implementations, such as MrBayes, approximate the joint posterior probability density of phylogeny and model parameters using some form of MCMC sampling (typically based on the Metropolis-Hastings algorithm). These methods quietly specify a means of updating the value of each parameter (the proposal mechanisms), the probability of invoking each proposal mechanism (the proposal probability), and the magnitude of the proposed change issued by each proposal mechanism (the tuning parameters). Proposal mechanism design is an art form (there are no hard rules that ensure valid and efficient MCMC sampling for all problems). For this reason, for many (non-phylogenetic) Bayesian inference methods, it is the responsibility of the investigator to explore a range of proposal probabilities and tuning parameterizations that deliver acceptable MCMC performance.
Accordingly, most researchers familiar with Bayesian inference would consider it extremely naïve to expect that any specific MCMC sampling design would perform well for all (or even most) empirical data sets, especially in the very difficult case of phylogeny estimation. Nevertheless, the default settings of existing Bayesian phylogeny estimation programs are so successful that we are actually “shocked, shocked to find that MrBayes does not solve all of our problems!!”. Without going into detail (as doing so would constitute an entirely separate post), the analyses detailed in the supporting material of the Hackett et al. study reads like a recipe for failure, and I would venture that the putative 'impossibility' of obtaining a reliable estimate with MrBayes in this case falls squarely under the third inference scenario defined above.
What does this mean for our phylogenetic community? First, I would argue that researchers interested in Bayesian estimation of phylogeny need to become much, much more sophisticated about diagnosing MCMC performance, carefully assessing convergence (ensuring that the chain has reached the stationary distribution, which is the joint posterior probability density of interest), mixing (assessing movement of the chain over the stationary distribution in proportion to the posterior probability of the parameter space), and sampling intensity (assessing adequacy of the number of independent samples used to approximate the posterior probability). Second, I believe that developers of Bayesian methods need to encourage and facilitate more vigorous and nuanced exploration of MCMC performance among users of these methods.
26 comments:
Sorry for the long post...it was a very long flight back home.
A few comments to folks who made comments on Rich's original post...
@ Todd Jackman:
"...if really big data sets drastically differ in trees produced depending on assumed models and the only way to fix it is to arbitrarily fiddle with priors, then likelihood trees seem like a better alternative."
I suppose that you may have been joking, but of course, no one would advise 'arbitrarily fiddling with models and priors' as a means of properly diagnosing and troubleshooting MCMC performance. However, comparing the marginal log likelihoods estimated under alternative models using Bayes factors provides a valid means for objectively choosing among alternative models, etc. Moreover, fast ML methods are prone to convergence problems as well, even though this may not be widely acknowledged.
@ Dan Rabosky:
"I've analyzed datasets that very quickly reached apparent stationarity, and for which independent runs reached the same stable plateau. But in fact, within individual runs, shifts between islands in tree space occurred approximately every 30-50 million generations."
I've also encountered the symptoms that you describe: independent convergence to a transiently stable plateau in the marginal log likelihood plot. From my experience, this most typically stems from model misspecification, namely, specification of an under-parameterized (mixed) model that does not adequately capture the complexity in the data. There is anecdotal evidence to suggest that this was a contributing factor to the problems encountered by Hackett et al.
@ blackrim:
"...the larger question users should be asking is whether they are looking for the ML tree with attempts at the confidence interval (using some technique) or are users looking for samples from the posterior distribution of trees."
I agree: this is an important consideration. Personally, I'm generally more interested in making phylogeny-based inferences (e.g., on rates of diversification, character evolution, etc.), rather than in the phylogeny per se (that is, in revising classification, etc.). Accordingly, I'm actually more interested in the posterior probability density of trees that provides a natural and valid means of accommodating phylogenetic uncertainty in these evolutionary inferences than I am in the point estimate of the tree.
Thanks Brian. I particularly liked your point about RAxML likelihood scores versus MrBayes ML scores. You suggest that the Hackett et al. strategy reads like a recipe for failure and that further details are beyond the scope of your post, but I was hoping you could highlight a few specific strategies that you've found particularly useful for Bayesian searches of large datasets.
Hey Rich:
Unfortunately, even a very cursory discussion of strategies MCMC approximation of large trees is way beyond the scope of this post (and certainly a comment to it!), but there are a few obvious issues. As I mentioned in my reply to Dan's comment, convergence (and other aspects of MCMC performance) are influenced by model adequacy.
Hackett et al. provide a fairly limited description of their analyses (which is a bit frustrating given that there are no limits on the length of online supporting information). In any case, they apparently attempt two (almost certainly inadequate) modeling schemes: (1) a uniform GTR+I+G model with parameters shared across the 19 loci (which oddly caused the program to crash), and (2) a partitioned model in which the revmat (the six substitution-rate) parameters were unlinked across the 19 loci.
Accordingly, Hackett et al. excluded parameters that are believed to accommodate the most important modeling components (see, e.g., Nylander et al. 2004; Brandley et al., 2005); such as (1) allowing independent alpha parameters among the 19 loci (which describes the shape of the gamma-distributed among-site substitution rate variation, (2) allowing rates to vary among the 19 partitions (by setting the 'prset' command to 'variable'), (3) including partitions for codon positions within the coding regions, etc.
[As an aside, Hackett et al. chose to accommodate among-site substitution rate variation (ASRV) using a combination of pInv (proportion of invariant sites model) and G (the gamma disctributed rates model, which models the rate of a site as a gamma-distributed random variable, which is approximated with four discrete rate categories). When you attempt to accommodate ASRV by both binning sites as {variant | invariant} while simultaneously treating the rate each site as a gamma distributed random variable, you may induce two regions of hight posterior probability, which causes problems with convergence of the MCMC.
Specifically, you could have a fairly large pInv and moderate gamma distributed rate variation (larger alpha shape parameter value), or you could have a fairly small pInv and rather high gamma distributed rate variation (smaller alpha shape parameter value). This induces a 'saddle' in the joint posterior density, so you are better off describing ASRV with gamma (because it is more flexible), and using more discrete rate categories to approximate this continuous gamma distributed asrv.]
Anyway...
Secondly, efficient performance of the Metropolis-coupling mechanism becomes increasingly important for large-scale (or otherwise complex) inference problems, as this greatly facilitates mixing among regions of high posterior probability.
Accordingly, it is critical to monitor and adjust behavior of Metropolis-coupling mechanism; i.e., to ensure that acceptance rates of attempted swaps between the cold chain and each of the other three (by default) incrementally heated chains fall within the target window (~20-60%) [The acceptance rates are reported by calling the 'sump' command.] If the acceptance rates are not within the desired window, then you can adjust the 'temp' parameter, T, which controls the degree of incremental heating applied to the Metropolis-coupled chains.
When N MC3 chains are run, N–1 of them are heated. The heat H, (the power to which the posterior probability is raised) applied to the ith chain is a function of T and the number of chain i, Specifically,
H = 1 / 1 + (T (i * 1)).
When H = 0, all trees have equal probability, whereas H = 1 for the "cold" chain. The higher the temperature, T, the more likely the heated chains are to move between isolated peaks in the posterior distribution. However, excessive heating may lead to very low acceptance rates for swaps between different chains.
Moreover, you can adjust the number of incrementally heated chains (the only option considered by Hackett et al.), the frequency with which swaps are attempted (by default, every 1000 cycles of the MCMC), the number of swaps attempted at each swap interval, etc.
Finally, it is important to monitor the mixing of all other parameters (contrary to the previous comment by Dan), which can be reported using the 'sump' command in MrBayes, or using other means, and the proposal probabilities and/or tuning parameters for poorly mixing parameters can be adjusted to bring these into the target window (~20–60% acceptance rates).
Hi Brian, thanks for the thoughtful comments. I have a question about your aside regarding gamma and pinv - why does this make a saddle rather than a ridge?
Somebody should get the word out about the pinv + gamma problem. Pretty much everybody does this and I didn't know about the problem until you told me at Bodega.
In my experience the temperature of the incrementally heated chains always needs to be reduced from the default value for appropriate swapping in large datasets. My experience is limited, but I gather this the experience others are having as well.
Hi Luke:
What I really have in mind is an inverted trough (at least for the most typical parameter-interaction scenario), although I have encountered data sets that are indeed more appropriately described as 'saddles'. The distinction, of course, depends on the continuity/bimodality of the pInv/alpha parameter interaction.
As a (recursive) aside, I should acknowledge that Greg Pauly deserves credit for pointing out that the 'ridge' metaphor (which I have previously used to describe this phenomenon, and how it was originally described to me by Ziheng Yang) might be a potential source of confusion for some people, as it may unintentionally connote a linear series of peaks (like a 'ridgeback' or 'sawtooth' mountain range, I suppose).
It's all somewhat academic, of course, given that such structures do not have a straightforward representation in the N-dimensional joint posterior probability density (where N is the number of free parameters), but from a strictly pedantic perspective, perhaps the less euphonious 'inverted trough' may be the most generally appropriate metaphor.
OK that makes more sense. My thinking was that using gamma is arbitrary, breaking it into four categories is arbitrary, so why not add one more category where rates are zero. The distribution is no longer gamma, but I wonder why, more specifically, that matters? Is it just that we have to add an additional nonidentifiable parameter? Or do we really believe that the rate distribution across sites is drawn from a gamma distribution?
One more question for you all, if you're still paying attention to this long thread: if tree space really does contain two islands of high likelihood, separated by a large valley, it seems to me quite possible that even a well-behaved MCMC run might do some of the things you describe. So, if one peak represents ~40% of the posterior probability and the other ~60%, then the chain might spend a few million generations on the first peak, then shift, and spend a few million on the second, then shift back. Or, move the peaks apart, and substitute "billion" for "million." In this case, if two runs end at different likelihoods, one should still want to combine them at the end to get the proper posterior probabilities.
Although many of the discussed points about the parameter selection surely influence the performance of the MCMC, I think that the principal problem of this analyses on big/complex data sets, is the same Metropolis-Hasting algorithm.
I love MH algorithm (I'm a huge fan of simulated annealing), it can be highly effective when solution where near to the optimal score (likelihood in this case). But for worse trees, the number of topologies tried is very small (just few branches changed at each iteration). Then it is not rare (as commented by Dan Rabosky) that the runs require 20-30 millions of generations to found another island.
For the "archaic" parsimony, the limits of simple branch perturbation (TBR) is about 100-500 taxa (depending on the complexity of the data) in real data sets, so I expect that the actual implementations of Bayesian analysis (i.e. MrBayes) is in the same limit (unless you have the time to run billions of generations!).
Hi Luke:
In answer to your first point: accommodating ASRV using a Gamma distribution is an inherently Bayesian approach (even when used under ML estimation). That is, we specify a parametric distribution (a prior probability density) to describe ASRV, which we effectively integrate over (or more precisely, we sum over the discrete approximation of this continuous prior density). It is valid to increase the number of discrete categories used to approximate the underlying continuous gamma distribution (as increasing the number of discrete categories better accommodates sites with very low/invariant substitution rates), but even so, we are still choosing to describe the behavior of the data as a Gamma distribution, rather than some mixed model.
The problem with using a mixed (pInv+G) model to describe ASRV is parameter interaction, the fact that we are attempting to capture variation in the data with two mutually exclusive parameters. In other words, the pInv and alpha parameters are non-identifiable.
Given that the gamma can accommodate sites with very low rates, there is no reason to use a mixed model to accommodate ASRV.
[Aside: Keep in mind that the likelihood of each site is integrated over the gamma distribution (by calculating the value for each rate category and summing the likelihoods), such that the computational burden (and run times) will scale linearly with the number of rate categories used to approximate the continuous gamma. i.e., All else being equal, an analysis with 6 rate categories will have a run time 1.5 that of a run with 4 rate categories.]
@ Luke J. Harmon:
"...it seems to me quite possible that even a well-behaved MCMC run might...spend a few million generations on the first peak, then shift, and spend a few million on the second, then shift back."
Indeed, this would be the desired behavior of a well behaved MCMC sampler, and this property of the chain is called mixing. Remember, that the goal is not optimization (as in ML, where the objective is to find the point estimates of all free parameters that maximize the likelihood function), but rather to integrate over the entire parameter space, sampling at each point in the state space (set of parameter values in the joint posterior probability density) in proportion to its posterior probability. This is why we are interested in the marginal probabilities of parameters rather than their MLEs (point estimates).
If the marginal posterior probability of a parameter is bimodal, then it may be more appropriate to present the entire density (or a summary that better captures the entire density, such as the HPD) rather than using a more conventional summary statistic (e.g., it would be inappropriate to summarize the value of alpha in the case that we have been discussing as the mean/median, etc., between to regions of high posterior probability).
You mention that "...if two runs end at different likelihoods, one should still want to combine them at the end to get the proper posterior probabilities.". I'm not sure I follow; if both chains have mixed over the entire parameter space and sampled from it in proportion to the posterior probability, then the marginal likelihoods of the two chains should be (approximately) equal.
It is only problematic when the two chains become entrapped in alternate regions (e.g., chain 1 restricted to region A, and chain 2 restricted to region B) and/or the two chains do not mix over alternate regions of the joint posterior probability (e.g., both chains 1 and 2 restricted to regions A or B) that it would be inappropriate to combine samples, because in this case the chains are ill-behaved. In the first case, the marginal probabilities of (at least some) parameters will be different, which will alert us to the fact that the MCMC has experienced issues with convergence and mixing.
The second scenario is somewhat more dangerous, as it will appear that both chains (presumably initiated from random points in the joint posterior probability) have independently converged to and sampled from a single region of the joint posterior probability density (a false stationary distribution), which is why it is crucial to perform multiple independent runs and pay attention to the behavior of the Metropolis-coupling mechanisms (which reduce the incidence of this pathological behavior).
@ Salva:
"I love MH algorithm...[but] for worse trees, the number of topologies tried is very small...[such that] the runs require 20-30 millions of generations to find another island."
To clarify, the Metropolis-Hastings algorithm governs the behavior of the MCMC; specifically, by calculating the acceptance probabilities of proposed changes to the state of the chain. However, the MH algorithm is not is not a proposal mechanism per se, and it is not specific to any particular proposal mechanism(s).
Specific to your comment, the MH algorithm is not an algorithm that proposes perturbations to the topology (or to the values of other parameters); rather, it calculates the acceptance probabilities of changes proposed by the action of the selected proposal mechanisms.
The topology proposal mechanisms implemented in MrBayes are similar to those used for optimization methods (ML and parsimony) implemented in other programs (PAUP, BEAST), including NNI, SPR and (extended) TBR.
Remember, also, that the goal is not to "find tree islands" (as in optimization methods), but rather to have the chain mix over (and proportionally sample from) the entire state space.
Hi Brian,
What if the substitution process is truly bi-modal? For example - in a highly conserved coding sequence, first and second (codon) positions could be entirely invariant, while third position substitution rates could follow a Γ distribution with relatively high α, no? No Γ distribution, regardless of α, has this form (whereas I + Γ does, in fact).
Obviously, this may just obviate the identifiability problem that you mention (but if we drop the invariant sites parameter from our model, we'd never know).
Hey, thanks for the great discussion of these arcane but important issues. Great to read this right as I am dealing with MrBayes convergence problems of my own in two large multilocus datasets. I've gone back and looked at the chain swap acceptance rate statistics on one of these, and indeed several do fall outside the 20-60% range. So perhaps I need to experiment with altering the temp parameter, but I am a newbie at this. I wish there were some detailed troubleshooting advice for MrBayes available online. What's available in the MrBayes wiki is pretty spartan. In my experience as a pretty unschooled MrBayes user, increasing the number of data partitions beyond 4 or 5 and the number of unlinked model parameters, even if they are the the appropriate ones determined by AIC scores, virtually guarantees convergence problems.
Hi Aloysius:
I agree that guidelines for troubleshooting MCMC performance for Bayesian estimation of phylogeny are difficult to come by. There are some good books on the more general problem (e.g., Markov Chain Monte Carlo in Practice), but these are fairly dense treatments for which application to the phylogeny problem may not be obvious.
You mentioned that: "In my experience...increasing the number of data partitions beyond 4 or 5 and the number of unlinked model parameters...virtually guarantees convergence problems."
This is an interesting observation. On one hand, more complex models will have more dimensions in their state space, with a corresponding increase in the complexity and shape of the posterior distribution. We might therefore expect that the increased complexity of the joint posterior probability density might slow convergence and mixing of the MCMC.
Although somewhat counterintuitive, however, more realistic models may theoretically lead to posterior distributions that are actually easier to traverse using MCMC, despite the increase in the number of parameters. In fact, this expectation has been demonstrated (see, e.g., Nylander et al. 2004).
In my experience, more complex (highly partitioned) models tend to converge much more quickly and reliably relative to 'simpler' models. The exception, however, is when the more complex model includes more than a couple 'weak' parameters. That is, superfluous parameters that are not capturing patterns in the data (i.e., parameters for which there are few or no observations available to estimate these parameters).
The most common scenario involves one or more 'revmat' parameters (which specify the six nucleotide substitution rates in the GTR model) for short data partitions. (Note that this can occur even if GTR is selected for the partition using AIC, etc., implemented in ModelTest.) This can be diagnosed by examining the marginal posterior probability densities for all parameters (e.g., using Tracer) to identify parameters that are not mixing well and have anonymously low ESS values.
Hi Liam:
I think I'll write a separate post dealing with approaches for accommodating ASRV, since there seems to be quite a bit of interest in this problem, so if you don't mind, I think that I'll defer addressing your question for that post.
This discussion has been great, but has anybody else noticed that Brian is mounting a wooden reindeer in his blogspot photograph?
Actually, Rich, the photo is of me reindeer tipping (huge Canadian pastime).
There is a close relationship between the MCMC algorithm and optimization. The probability to accept a proposed state depends on the difference among the score of actual and the proposed state. Then although the strict optimum might be not sampled, the new states moves around the optimal island. The convergence of results is a consequence of this. So even if the objective is not to find tree islands, the difficulty to reach them (in reasonable times) made the method "fail" for large/complex data.
Maybe a more radical mechanism to propose new states need to be developed to such cases, but as in the case of over-heating, this might produce very low acceptance rates (then although MH not specify a particular proposal mechanism, in practice it requires that the proposed state would be very similar to actual state).
This is a fantastic discussion, as someone that is new to the field and plans to build large phylogeny in the near future, I'm curious. Are there studies out there that use fake data to compare methods to see how well they recover original known parameters used to create the dataset? If not, why not? Is it considered an overly simplistic problem or is it difficult to generate such data? It seems to me that much of the debate could be explored via this approach.
Yes, there are many such studies. A classic example is:
Huelsenbeck, J. P., and D. M. Hillis. 1993. Success of phylogenetic methods in the four-taxon case. Systematic Biology 42: 247-264.
@ Salva:
"There is a close relationship between the MCMC algorithm and optimization."
I suspect that your self-professed interest in simulated annealing (SA) algorithms may be coloring your perspective on Metropolis-Hastings (MH) based samplers. Although it is true that SA is a special case of the MH algorithm, it is very unusual in that it is a greedy algorithm (i.e., it is used for optimization rather than integration).
[As an aside, it is axiomatic that there exists a relationship between 'optimal tree islands' (by which I presume you mean parameter space with high log likelihood) and regions of parameter space with high posterior probability...the latter is the product of the former and the joint prior probability density of the parameters.]
In any case, you seem to presuppose that there is a problem with existing MCMC-based methods for Bayesian estimation of phylogeny (that is causing them to fail to 'reach optimal tree islands in reasonable time'). Accordingly, it seems that you have missed (or perhaps reject) the primary premise of my post: that the vast majority of the reported 'failures' of existing Bayesian inference methods (such as MrBayes) are caused by user error rather than by problems with the methods per se.
In other words, most putatively intractable phylogenetic problems reflect a failure to use existing methods to their full potential, rather being caused by inherent limitations of these methods. This is not to trivialize the computational effort/special skills required to perform rigorous analyses of large and/or complex data sets with Bayesian methods. On the contrary, I believe that these complaints against existing methods are based upon the very naïve and curious expectation that it should be 'easy' to obtain reliable estimates for very hard phylogenetic problems.
The days when data analysis was the 'easy' part of phylogenetic research are behind us, and this—in my opinion—is a sign that our discipline is advancing as a science.
@ Anonymous:
I read your question as: "Have previous simulation studies compared the performance of alternative (e.g., ML and Bayesian) phylogenetic inference methods for large-scale problems?"
The answer, basically, is 'No'. Simulation studies are meaningful to the extent that they rigorously explore the relevant parameter space (i.e., the conditions that are likely to be encountered in the application of the methods under consideration to real data sets).
For large-scale phylogenetic problems, the relevant parameter space is of very high dimension, and this complexity makes it difficult to simulate data that are likely to adequately approximate reality. Moreover, the need to thoroughly explore the simulated parameter space (i.e., by analyzing many simulated data sets that represent points across the simulated parameter space) is made very difficult by the computational burden of large-scale phylogenetic analyses.
For example, simulation studies that I have performed (which involve much simpler inference problems...i.e., where the relevant parameter space is of much lower dimension) have required the analysis of hundreds of thousands or many millions of simulated data sets, which would simply not be feasible for the large-scale phylogeny problem.
The problem with large simulated data sets make me wonder why the experimental viral phylogeny approach of from the early 90's was abandoned. In fact, the sequences of the entire virus for the original 8 taxon experimental phylogeny was never published (and is not available in genbank).
If I'm mistaken and there is more than just 1 KB of sequence and 63 variable sites out there for the original studies, let me know.
Hillis,D.M., Bull,J.J., White,M.E., Badgett,M.R. and Molineux,I.J.
Experimental phylogenetics: generation of a known phylogeny
Science 255 (5044), 589-592 (1992)
and
DM Hillis, JP Huelsenbeck, and CW Cunningham
Application and accuracy of molecular phylogenies
(29 April 1994)
Science 264 (5159), 671-677
(This comment is about the very first comment I made, way back at the top of this page)
The reason I included a quote from Yang's paper is that he does appear to be advocating arbitrarily (maybe just a little bit arbitrarily) adjusting priors with the goal of reducing posterior probabilities for conflicting nodes.
"In both the small and large datasets, the data size-dependent prior almost always led to reduced PPs for the MAP trees. It thus appears useful in reducing the apparent conflicts in Bayesian phylogenetic analysis. The implementation here involves some arbitrariness."
I disagree with Yang in his view of hard polytomies, and I don't think it's a great paper, but the big conflict exists in his paper because the conflicts arise from Amino Acid versus nucleotide models, something that can't be directly compared for likelihood scores.
This is an adendum to my previous comment. It was pontes out to me that the ML scores generated by RAxML and MrBayes are not strictly comparable. Thus the fact that the score from RAxML is lower than that from the failed Bayesian analyses may not be meaningful.
Post a Comment