Thursday, July 17, 2008

A Very Non-parsimonious Parsimony Model

There's a nifty paper in the June issue of Systematic Biology by Huelselbeck et al. The paper has a nifty title: "A Bayesian Perspective on a Non-parsimonious Parsimony Model." Basically, the authors implement a fully Bayesian equivalent to parsimony based on Tuffley and Steel's (1997) likelihood model. This "no common mechanism" model requires separate estimates of branch lengths for each branch and each site; adding one base pair to your sequence requires estimating 2n-3 new parameters for your model of sequence evolution, where n is the number of species in the tree. That's a lot of parameters; hence the title.

We knew that in a likelihood framework, the no common mechanism model's ML tree was the same as the maximum parsimony tree. It is not really surprising, given the commonalities between likelihood and Bayesian frameworks, that this equivalence holds for Huelsenbeck et al.'s new implementation (see their figure 5, above, showing a perfect negative relationship between log-likelihood and parsimony score). Still, there are a number of very interesting tidbits in this paper. First, one can now compare the "fit" of a parsimony model to other, more commonly used, Bayesian models. Second, this framework provides natural measures of support for a parsimony analysis, rather than approximations like bootstrap values. The catch here is that the paper itself provides compelling arguments against even implementing this model:

The no-common mechanism model is very peculiar, and the authors have mixed feelings about having implemented the method in a Bayesian framework. (Huelsenbeck et al., p. 415)

There's also a couple nifty new tree-searching "tricks" that the authors implement to help search tree space more efficiently under this complex model. These two (Gibbs-like TBR move and Gibbs eraser) may affect your life, some day, by making your tree searches run faster.

30 comments:

Dan said...

The idea of using the NCM model in model comparisons appeals to me. It's a safe bet that the outcome of such comparisons is pretty much always going to favor a non-NCM model when using any metric that takes number of parameters into account, but it's still pretty neat to be able to do it. I wonder if you would even infer NCM as the appropriate model when it's the true model underlying character evolution?

Salva said...

"Second, this framework provides natural measures of support for a parsimony analysis, rather than approximations like bootstrap values."

This is false. Actually the Bayesian framework can not provide any measure of support for parsimony (or even maximum likelihood), because the use of majority rule trees can bias frequencies for unstable topologies! (as is showed in [1] and [2]).

[1] Sharkey, M.J., Leathers, J.W. 2001. Cladistics 17: 282-284.
[2] Goloboff, P.A., Pol, D. 2005. In: Albert, V.A. Ed. Parsimony, phylogeny, and genomics. Oxford univ., Oxford, pp. 148-159.

Luke J. Harmon said...

Thanks for the comment. Are you arguing against Bayesian support values in general, or specifically in this context?

Glor said...

This was a general criticism from a Cladistic perspective. Anyone who accepts Salva's criticism would reject both Bayesian posterior probabilities and bootstrap values as valid measures of tree support. I just read the original Sharkey & Leathers paper, which is three pages long and uses a single example to make an obvious point. It's main complaint derives from teh observation that majority rule trees recover nodes that aren't found in all of the 'fundamental cladograms.' This shouldn't come as a surprise to anyone - this is the point of majority rule. The suggestion that majority rule "equates reliability with ambiguity" is speculative and entirely unconvincing. I'll be interested to hear what others think, but I found this to be another example of Cladists trying to throw the baby out with the bathwater due to preconceived methodological biases.

Dan Warren said...

I just read the Sharkey and Leathers paper, and I don't think it actually applies to the consensus trees that are built from Bayesian MCMC runs. The discussion there focuses on building consensus trees from sets of equally parsimonious trees (or those of equal likelihood). The reason that there are multiple trees in those cases is because there is ambiguity in the support for different topologies that is present in the data, and their point is that a consensus tree on that set of trees mistakes ambiguity for reliability.

As I see it, that logic doesn't work when you come to the construction of consensus trees in Bayesian analyses. In the course of an MCMC run, a single topology can occur many times, more or less in proportion to their true posterior probability (ideally). This doesn't happen in the sets of trees discussed in the Sharkey and Leathers paper. In taking the consensus from an MCMC run, topologies (and clades) are allowed to "vote" based on how much support there is for them (reliability) rather than each one only being allowed to "vote" once. Taking the consensus of that set of trees is not mistaking ambiguity for reliability, because the trees should be present in proportion to the amount of support for them in the data, rather than each one only being present once.

Salva said...

Luke, the argument is only for Bayesian support, and Dan, I think it is compelling. Think in a case of a 'floating' taxon, for example in the topology (a (b (c (d e)))), with x floating in any position, the frequency of the group (d e) is higher than (c (d e)) because there are only two trees with contradict (d e), but there are 4 than contradict (c (d e)), the both clades appears with different support, and removing the taxa x, both clades show the “correct” measure (I.e. are equally supported) [ play yourself with that kind of examples ;) ].

Contrary to the commentary of Glor, other support measures (like bootstrap, or jacknife), can be protected from this problem if you collapse the unsupported branches after each replication. I guess that the same can be doing for each generation in a Bayesian analysis, so removing the potential problems.

I think that annealing statistics can be a rich camp, specially for big data sets, in which only partial answers can be estimated. But it is important to protect it from over-resolution!

Have a nice weekend :D!

Luke J. Harmon said...

Interesting example, Salva. Here's my rejoinder:

I would think that the support values are context-dependent. This means that including different taxa changes support values for clades, and that is meaningful - that is, it's a feature, not a bug. In your example, the support for the clade (d e) should be interpreted as the evidence for grouping these two taxa together excluding all the other taxa in the analysis. Given that, it makes good sense that this value depends on whether or not taxon x is included in the analysis. Here, if x is "floating" (this term seems a little bit unclear), there's a higher chance it's somewhere in (c (d e)) than in (d e), so including it alters the support values accordingly. In fact, I don't know if we want the method to behave differently than that.

Glor said...

Thanks Salva, that helps my understanding this your critique a bit more. Your argument represents an interesting pathology, but I still don't believe it's one that warrants a revision of existing methodologies. First, I have never observed the phenomenon you describe involving a 'floating' taxon (which I presume involves random placement of a particular OTU at all possible positions in the topology). This would obviously present some serious challenges to phylogenetic inference, but seems unlikely with a data-rich molecular analysis. I've never seen such randomness in any of my own analyses.

Second, you can't rescue Bayesian trees in the manner you've suggested because they represent individual samples from a shared posterior distribution. The only way accommodate your criticism would be to make a strict consensus from this distribution. While this might be appropriate in some cases, I don't think the problem you've mentioned should be a main motivation for doing this.

Dan Warren said...

I agree with Luke that this doesn't actually seem like a flaw of the method, it seems to accurately represent the uncertainty present based on the topology of the remainder of the tree.

As for the idea of rescuing Bayes by doing a strict consensus, I can't think of any conditions under which it makes sense to take the strict consensus of an MCMC run. The idealized MCMC run (i.e., one that runs for an infinite amount of time) would be one in which every possible tree, no matter how bad, is visited in proportion to its posterior probability. Every tree has a non-zero posterior, so under those idealized conditions every tree would be present at least once in the chain and a strict consensus tree would necessarily be a star tree.

Given that the method produces zero resolution in the idealized case, I can't see how it could possibly be defensible in real, non-ideal scenarios. One of the big pitfalls of MCMC is that lack of convergence is "rewarded" with high support values at nodes, and taking strict rather than majority rule consensus seems like a "solution" that would only exacerbate the problem.

liam.revell said...

Nice comment, Dan.

Salva's point about bootstraps vs. posterior probabilities is interesting.
See Suzuki et al. (2002; PNAS 99:16138–16143). In that paper, the authors generated sequence on four taxon topologies: ((a,b),(c,d)), ((a,c),(b,d)), and ((a,d),(b,c)); concatenated the sequences; and found that Bayesian posterior probabilities identified one of the four topologies with high confidence. The bootstrap distribution of topologies contained each topology in approximately equal frequency.

Is the counter argument to Suzuki et al. that concatenating data from different histories violates the assumption of one true tree? (As would the "floating" taxon described by Salva - in fact one way to generate the three topologies used by Suzuki et al. is to fix the single unrooted topology of any three taxa, and then "float" the fourth taxon to any of the three branches of the tree.) If so, it's not surprising that the P-values calculated from the method are biased when model assumptions are violated (as they are in many forms of statistical inference, e.g., the t-test(!), under conditions in which the assumptions of the method are violated by the data).

Susan Perkins said...

Some of this sounds a lot like the real data situation I've been struggling with. I have a set of about 25 taxa (of malaria parasites) and have sequences from the three mitochondrial protein coding genes. I've analyzed these with parsimony, ML (via RaxML) and Bayesian methods, separately and concatenated (with models estimated separately for the latter two) and I essentially get 11 different topologies for the 12 analyses. The clades that are not stable of course have low support, but the concatenated dataset is not the best - one of the three genes is the best (judged on support values). It's all very taxon-dependent as well, leading me to think that there's possibly a bunch of homoplasy in the dataset (however analyzing the amino acids don't solve the instability, either). Not really sure what to make of all this - any suggestions are welcome! Heading in to the office to wrestle with this some more - I'll have to check out the Suzuki paper for the discussion - thanks, Liam.

liam.revell said...

Susan, I'm not sure how germane Suzuki et al. is to your particular problem, since your data (if we disregard the possibility of mtDNA recombination) come from a single history.

However I also think (and I stand to be corrected, here) that we should not necessarily expect more sites to increase the posterior probability of a particular phylogenetic arrangement (even if the sites have evolved under the same history).

Consider the worst possible case in which the rate of evolution at a particular site is extremely high. In this situation, the data at the tips will be random. Random data patterns will tend to have low probability on the true tree and other highly probable trees. Since the probabilities of each sites' patterns are multiplied over all sites to obtain the probability of any particular tree with branch lengths, additional sites with very high rate (noise) will tend to flatten the posterior distribution and cause more different trees to be sampled (as trees are sampled according to their probabilities).

Of course, this problem should be alleviated somewhat by incorporating rate heterogeneity into our model, as a "random" site pattern (with respect to the tree) is more probable under a model in which some sites have a high rate. (Of course, the probability of any random site pattern as the rate, r->Inf, on any tree should be P = 0.25^n for equal base frequencies & n taxa - so quite small for large n.)

Since bootstraps and posterior probabilities are of interest - does anyone know if the effect on posterior probabilities and bootstraps of adding random or invariant sites to an alignment has been studied?

Just a guy who likes systematics said...

From my understanding, the more reasonable "concerns" presented by Goloboff and Pol (based on a talk from 3-5 years ago, not the paper) had to deal with the effects of missing data in a Bayesian framework. The effect, if I recall correctly, was that the taxon with a lot of missing data fell into the middle of the tree (because of the effects of priors and clade size) with high support. In this situation, it would not be a more natural measure of support, and Salva's point would be relevant. From my experience, I have never knowingly witnessed this result in my analyses, but if people start re-running "historic" parsimony analyses that have a lot of missing data, then we might start to see some spurious results...

Glor said...

Just two quick points. First to Dan, of course I was referring to a strict consensus from the post-burnin distribution, which we do not expect to represent a sample of all possible topologies. Not that I think anyone should be doing this either...

Second, the point made by the Suzuki paper (& Liam) about well-supported Bayesian trees from concatenation of discordant gene trees is a one of the main justification being to encourage species tree analyses we've discussed previously (1 & 2).

Luke J. Harmon said...

Liam, your comment made me wonder about trying a "complete garbage" test on really nasty data sets. One might calculate the likelihood of that model as (1/4)^n, where n is the number of sites - or, more simply, the log likelihood as n ln (1/4). If your ln-likelihood is anywhere near that (say, within 10), you're in the "garbage zone."

Dan Warren said...

In response to Rich - I figured that that was you meant, but I would still be leery of a strict consensus tree. The only way a clade would appear in a strict consensus is if you weren't exploring alternative hypotheses about the taxa in that clade. There are two possible reasons why this might be so - the more desirable one is that other trees that did not contain that clade were all so much lower in PP that moves to those trees were never accepted. The less desirable possibility is that the area of treespace that does not contain that clade wasn't adequately explored. I think I'd be less concerned about the latter possibility with small clades, but given my general concerns about failure to diagnose a lack of convergence in MCMC, I'd honestly be a little suspicious of any decent-sized clade I saw in a strict consensus tree. Maybe I'm overly cautious about that, though.

Even so, the issue cuts both ways - part of how MCMC works is that you're occasionally supposed to be including some fairly crappy trees, at a frequency that decreases with how crappy they are. In a strict consensus, you will occasionally lose support for a clade that actually has a very high posterior just due to dumb luck - you happened to sample from the chain right when it took a big step down that it might have immediately come back from.

Dan Warren said...

Oh, and to be clear - I know you weren't suggesting this as a real solution, Rich. I'm just nerding around with the idea.

Luke J. Harmon said...

That garbage test should be L = (1/4)^(ns*nt), where ns is the number of sites and nt the number of taxa.

lnL = ns*nt*log(1/4)

liam.revell said...

In response to Rich and Dan:

Please correct me if I'm wrong, but my impression is that all regions of tree space should be visited at a frequency proportional to their posterior probability (that should be how Monte Carlo integration works).

Since all trees have some probability (see Luke's "garbage" likelihood for the expected probability of random data on an arbitrary tree - the point being, it's always non-zero), if the Markov chain is run for infinite time, all possible trees will be in the sample. Calculating a strict consensus tree would thus punish the user for running the chain for "too long" (because so doing allows improbable trees to be sampled, even if rarely)!

Does this make sense?

liam.revell said...

P.S. I thought that was the point of Dan's original post beginning "As for the idea of rescuing Bayes by doing a strict consensus".

Dan Warren said...

That's what I was meaning to say, yes. The use of a strict consensus tree would basically rely on only exploring a small section of tree space. It's probably not a huge problem to do this in a case where there's a single optimum, the MCMC run is short, burnin is excluded, and the chain is well-behaved - not that it would necessarily be a GOOD thing to do, but it would likely produce a topology that sort of approximated the right one. In any other case, though, it has the potential to throw out good clades just by exploring suboptimal areas of treespace or include bad clades due to convergence problems.

Salva said...

I'm not thinking on a strict consensus of all found trees. Rather I think that in each sampled generation, a local search can be performed to ensure that the sampled tree is in a local optimum, then the strict consensus of the actual generation, is the consensus of that 'tree island'. A quick way to doing this, is using TBR to collapse nodes [1].

Then when you made the final majority rule tree, you can be sure that each represented clade is present in most of the local optima (then you got a good idea about the support for the clade). Jacknife, bootstrap and symmetric resamplic are implemented in this way in TNT.

[1] Goloboff, P.A., Farris, J.S. 2001. Cladistics 17: S26-S34.

Dan Warren said...

I see what you're saying, but I still don't think it's a good idea. First off, there is absolutely no reason why you should want each generation to be on a local optimum - that's just not how MCMC works, nor how it should work.

The outcome of the scheme you outline would be (if I'm understanding correctly) an approximation of the consensus tree you'd get from doing a whole bunch of heuristic searches from random starting points, and then taking the majority rule tree of the strict consensus trees from each of those optima. Nothing inherently wrong with that, but the node support values will not be Bayesian posteriors, and depending on how you do it they may not even be very meaningful. Do you weight each strict consensus tree by the number of trees that go into it? By its likelihood? By some function of the two? The last approach might allow you to get something close to Bayesian posteriors, but as far as I can see it would just be a rough, and quite possibly bad, approximation of the posterior values you get from doing a majority rule tree from the whole set (as is the common practice). Basically you'd sacrifice a lot of resolution and produce support values that aren't necessarily interpretable as either Bayesian posteriors or bootstrap values, but are instead some weird third option that doesn't have a clear statistical justification.

liam.revell said...

Dan, nice comment. Please correct my comment below if I'm wrong.

I think the confusion stems from a (understandable) misconception about what's going on when the posterior distribution of trees is being sampled in a Bayesian analysis.

Unlike in a likelihood (or MP) analysis, the single tree with the highest probability (or shortest length) is not being sought. Otherwise we'd retain a single tree from our sample of the posterior distribution (the one with the highest probability).

Instead, what we're looking for is an estimate of the integral of the function defined by TREE x P(TREE|DATA), in which P(TREE|DATA) is the posterior probability of the tree given the data, with all branch lengths and other parameters thrown into TREE.

The integral of TREE x P(TREE|DATA) and the most probable tree will only be the same if P is symmetric about its maximum (think about the mean & mode in symmetric and asymmetric distributions). Of course, they are probably usually very close (since the most probable tree is usually the ML tree).

On the subject, do we use a "50% majority-rule" consensus criterion? If so, why? Why not 95%? If we had a sample of numeric values, some model for their distribution (say, Gaussian with known variance), and we wanted to use Monte Carlo integration to estimate the population mean and 95% credible interval for that mean wouldn't the limits defined by 95% of our sample define the 95% probable interval for that parameter? Why not the same with phylogenies? Someone smarter please answer this.

Dan Warren said...

Thanks Liam, you said it far more clearly than I did. You're definitely right about there being a lot of confusion about this point as well. I'm frequently surprised by how many people don't get the distinction between the goals of a heuristic search and those of MCMC.

As for your other question, I'm not 100% sure I understand - are you asking whether the 95% CI of the mean should be the central interval containing 95% of the data points? That may be the case with some distributions, but not all of them. I'm not really a math stats person, but I'm an expert at hand-waving so here goes. Somebody correct me if I'm making a fool of myself here.

Let's take the example a uniform distribution from 0-9. It would have a mean of 4.5, and the central interval containing 95% of the values would be 0.25 to 8.75. Let's say for the sake of argument that you got lucky and drew out a sample containing each of the integers from 0 to 9. If you were to do MC integration looking for the mean from that sample, you'd visit estimates of the mean in proportion to their ability to explain your data. How probable is that set of samples if the mean is 4.5? Pretty darn high. If it's 4.3? Still not too bad, but less probable than the true mean. If it's 1.2? Very unlikely. You can immediately see that, even though the distribution that the data was drawn from is uniform, the distribution of the mean isn't. When we do MC integration for the mean of the distribution, we're sampling from the distribution of probabilities of means (non-uniform), not the distribution of the original sample population (uniform).

Dan Warren said...

Oops, I forgot to get to the point of that example. The point is that you are mostly going to be visiting estimates of the mean that are close to the true value far more often than those that are far away. For that reason, the 95% CI of the mean will be much smaller than the central 95% interval of your data.

Glor said...

Liam and Dan - I'm on the same page. My bottom line at this point is that I don't think the original criticism is something we should be worrying about, or using as a justification to stop using Bayesian posteriors.

Dan Warren said...

Oh, for sure. We're now trying to settle the question of how hard and how frequently you have to flog the horse to make it look alive.

liam.revell said...

Dan, that's not what I meant. In your example, we'd very rarely (if ever) visit mu=1.2 (depending on the model, of course), so it would not be in 95% of our sample from the posterior - even though the observation "1" is in the central 95% of our sample, as you point out.

I think there is still some question about what we should do with our sample of trees from the posterior, especially what we should do about branch lengths. . . .

Dan Warren said...

Ah, okay. Sorry for the confusion.