A recent Dechronization post highlighted the unsuccessful attempts at Bayesian estimation of a large-scale bird phylogeny based on a multi-locus data set by Hackett et al. The apparent failure of MrBayes in this particular case (and under similarly challenging inference scenarios associated with large and/or complex data sets, e.g., Soltis et al., 2007; Moore et al., 2008) appears to raise serious concerns regarding our ability to estimate large-scale phylogeny using Bayesian methods.
However, it is important to carefully consider precisely what such studies have actually demonstrated: that Bayesian estimation of phylogeny appears to be intractable for certain data sets using default settings implemented in a particular program, MrBayes. Unfortunately, these anecdotal observations have led some researchers to a nested series of increasingly dubious and unsubstantiated conclusions. First, that it is impossible to reliably estimate phylogeny for this particular data set under any settings implemented in MrBayes, and more generally, that it is impossible to reliably estimate phylogeny for this particular data set not only using MrBayes but using any Bayesian methods, and finally by following this false premise to its ultimate conclusion, that it is impossible to reliably estimate phylogeny not only for this particular data set, but for any large-scale data set using Bayesian methods.
Although Bayesian estimation of phylogeny appears to succeed for the vast majority of empirical problems, there remain inference problems for which Bayesian estimation is apt to be intransigent, which may be usefully divided into three categories: (1) inference scenarios in which reliable Bayesian (or any other) estimation is likely to be problematic (e.g., whole-genome alignments for extremely large numbers of species); (2) inference scenarios in which rigorous application of existing Bayesian methods are apt to fail; and (3) inference scenarios in which imprudent application of existing Bayesian methods using default settings are apt to fail. I believe that the vast majority of reportedly “impossible” Bayesian phylogeny estimation problems fall within the latter two categories.
Existing implementations, such as MrBayes, approximate the joint posterior probability density of phylogeny and model parameters using some form of MCMC sampling (typically based on the Metropolis-Hastings algorithm). These methods quietly specify a means of updating the value of each parameter (the proposal mechanisms), the probability of invoking each proposal mechanism (the proposal probability), and the magnitude of the proposed change issued by each proposal mechanism (the tuning parameters). Proposal mechanism design is an art form (there are no hard rules that ensure valid and efficient MCMC sampling for all problems). For this reason, for many (non-phylogenetic) Bayesian inference methods, it is the responsibility of the investigator to explore a range of proposal probabilities and tuning parameterizations that deliver acceptable MCMC performance.
Accordingly, most researchers familiar with Bayesian inference would consider it extremely naïve to expect that any specific MCMC sampling design would perform well for all (or even most) empirical data sets, especially in the very difficult case of phylogeny estimation. Nevertheless, the default settings of existing Bayesian phylogeny estimation programs are so successful that we are actually “shocked, shocked to find that MrBayes does not solve all of our problems!!”. Without going into detail (as doing so would constitute an entirely separate post), the analyses detailed in the supporting material of the Hackett et al. study reads like a recipe for failure, and I would venture that the putative 'impossibility' of obtaining a reliable estimate with MrBayes in this case falls squarely under the third inference scenario defined above.
What does this mean for our phylogenetic community? First, I would argue that researchers interested in Bayesian estimation of phylogeny need to become much, much more sophisticated about diagnosing MCMC performance, carefully assessing convergence (ensuring that the chain has reached the stationary distribution, which is the joint posterior probability density of interest), mixing (assessing movement of the chain over the stationary distribution in proportion to the posterior probability of the parameter space), and sampling intensity (assessing adequacy of the number of independent samples used to approximate the posterior probability). Second, I believe that developers of Bayesian methods need to encourage and facilitate more vigorous and nuanced exploration of MCMC performance among users of these methods.
Researchers unwilling to develop the requisite knowledge to properly diagnose and troubleshoot MCMC performance should seriously consider alternative strategies, including collaboration with researchers who possess these skills or, of course, pursue alternative inference methods, including ‘fast’ ML approaches. However, it seems that most researchers are equally unclear about the potential deficiencies of the latter methods. Along these lines, and in the spirit of the anecdotal account that inspired this post, I note that I have encountered many data sets for which multiple independent searches using fast ML methods (implemented in GARLI and RAxML) rendered a series of estimates with significantly different MLEs, whereas convergence to a significantly higher mean marginal log likelihood using MrBayes appeared to be unproblematic. Indeed, Hackett et al. note that 80–90% of their fast ML searches converged to solutions with significantly different MLEs!! Moreover, the best of their fast ML searches--apparently based on a partitioned analysis using RAxML--resulted in a phylogeny with a log likelihood of -866,017.07, which is ~5,000–6,500 log likelihood units worse than the ‘unreliable’ plateaus in the time series plots of the marginal log likelihoods estimated with MrBayes!! Clearly, there are no easy solutions to these hard problems...