Sunday, January 17, 2010

Phylogenetic Model Selection, Sans PAUP*

Given that an increasing share of statistical software in ecology and evolution is essentially free – e.g., open source and/or non-proprietary – I used to be bothered by the lack of suitable alternatives to PAUP* (which requires a licensing fee) for certain phylogenetic applications. Foremost among these is perhaps the ability to perform statistically rigorous phylogenetic model selection. There are now a number of free alternatives for phylogenetic model selection, that do not require PAUP* (which is required by the widely used Modeltest and MrModeltest programs ). I've probably been living in a bubble, because I just learned of several of these yesterday, but I thought I'd flag a few for Dechronization readers who might find this info useful.

One that I have used extensively is ModelGenerator , from the nice folks at the NUI Maynooth Bioinformatics Group. This is quite useful, not least because it has a web interface that lets you upload batches of fasta-formatted alignments. Because the computation is distributed across many “idle” desktop computers at NUI Maynooth, the processing time is low – I’ve uploaded batches of alignments only to get a results file emailed back to me within ~20 minutes or so. You can also run the program locally on your own machine or email the author for source. Additional options that I have yet to explore include jModeltest , FindModel (another web-based tool), and MrAIC . I'm sure this list is incomplete and welcome comments on programs I've missed as well as strengths and weaknesses of those I've listed.

12 comments:

Mitchell Selfdrive said...

You might be interested in TOPALi: http://www.topali.org/

Jeet Sukumaran said...

Hi Dan,

With the increasing number of multi-locus datasets, for better or worse, I've become quite comfortable with sticking to a reasonably complex character substitutions model (e.g., HKY+G, or GTR+G) for each partition, while focussing more on worrying about the higher-level partitioning of the data. We have had mixture-models to deal with this for a while now, as well as, of course, the famous DPP-prior approach to integrating out uncertainty of this aspect of module structure, but the computational demands of these approaches have made practical applications in empirical analysis quite challenging.

I've been toying with an idea of developing an application that will do model selection on a higher-level to identify an optimum partitioning scheme: given a dataset and user-specified candidate subsets of the data constituting the blocks at the finest resolution, the application would try various combinations of blocks to statistically select a partitioning scheme.

Do you know of anything that does this? If not, once I clear my plate of the Ginkgo phylogeographic simulator project, I might give it a try.

Glor said...

The need for PAUP dies a little more each year...

While it may be reasonable to focus only on the complex models like HKY and GTR, I've been dinged by reviewers for not explicitly considering simpler models.

Dan Rabosky said...

Hi Jeet-

I agree with you entirely and have yet to work with a dataset where "rigorous statistical model selection" actually made any difference for the questions of interest, once I'd crossed a reasonably complex model threshold (eg HKY+G). Perhaps others have different experiences here, but this is generally true for the questions I work on. Far more important I think is ensuring that convergence is actually occurring and that you have some reasonable partitioning scheme. Now, given this, I think Rich is right that, at the present time, it is just too easy for reviewers to make a stink over this issue. I for one would prefer not to take a chance with this, just because it is really frustrating to have to comprehensively reanalyze everything because a reviewer was unhappy with the fact that - for example- you (arbitrarily) assigned a GTR+G+I model to a partition...

fdelsuc said...

Hi,
Treefinder also proposes model selection.
Concerning partitioned models, PUMA allows comparing the fit of some a priori defined partitioned schemes with unpartitioned, and mixture models in a Bayesian context, but you have to run the MCMC under each model first...

Jeet Sukumaran said...

Dan,

For the convergence issue, have a look at this post. Looks like a nice approach, that might give some additional insight along with AWTY etc., and also would be really simple to implement using, for example, Python + matplotlib.

Jeremy Brown said...

It's certainly nice to see free alternatives for something as central to phylogenetic studies as model selection. However, I'm inclined to agree with previous comments suggesting that such things as mixture (or partitioned) modeling and convergence checking are likely to be more important than model testing on individual partitions, assuming a reasonably complex model has been used. In particular,

(1) I think we need to be rigorously checking for convergence specifically regarding the parameter of interest (e.g., bipartition posterior probabilities, branch lengths, marginal likelihood, etc.)

(2) The fear of model overfitting should not be as great as the fear of model underfitting. Most simulation studies show very small errors due to overfitting relative to underfitting. For just a couple of examples, see Lemmon and Moriarty and Brown and Lemmon.

John said...

HyPhy (http://www.datam0nk3y.org/hyphy/doku.php) can run analysis selecting from all 203 reversible rate matrices, but it is very slow.

Jeet, I'm not sure if this is what you're after (or if it's even really useful), but a paper by Li et al. (2008) in Syst. Bio. tries to develop a method for optimal data partitioning in protein coding data by starting from the most complex set of partitions (x genes x 3 codon positions) and uses a centroid clustering method to produce a guide tree and pare down the partitioning scheme. The dataset is re-searched each time, and the optimality criterion they use for selecting the partitioning scheme is the BIC. The problem is that is requires SAS.

sergios-orestis kolokotronis said...

There's also FitModeL.

Now, regarding model selection, I've been using GTR+G for the past two-three years for nucleotide data, and I test for alternative models only in the case of protein data (or estimate my own GTR AA matrix in PAML and RAxML). If the best-fit model is HKY or TN, then GTR will adapt (i.e. the rate matrix).

An issue to keep in mind is the coestimation of Gamma's alpha shape parameter and the proportion of invariable sites (I or P-Inv). The coexistence of both always bugged me because one would think that when the Gamma distribution accounts for very low rates, null rates would be included (null as in invariable sites). I refer readers to Ziheng Yang's 2006 Computational Molecular Evolution book, chapter 4.3.1.4, pages 113-114.

---
This model [G+I] is somewhat pathological as the gamma distribution with α≤1 already allows for sites with very low rates; as a result, adding a proportion of invariable sites creates a strong correlation between p0 (a.k.a. I or P-Inv) and α, making impossible to estimate both parameters reliably (Yang 1993; Sullivan et al. 1999; Mayrose et al. 2005).
---

For more on the bimodality of the rate distribution and branch lengths, see the rest of the chapter in that book.


PS. What's with Blogger still not allowing HTML tags like blockquotes?

Ruhfel said...

Hi all,

Just wondering... for those of you who are comfortable using a GTR model when say a TVM model is chosen, how have you defended this in a publication? References would be much appreciated. I see many papers that just get away with saying "we tested for models, but RAxML doesn't implement those, so we used GTR + G". However, they never really justify it...

Best,
Brad

Andrés Parada said...

Hi Dan,
modelgenerator can run (really fast) througha java.jar executable.


(BTW, I am a newbie, I just discovered this awesome blog)

Rob Lanfear said...

Hi All,

A very late comment on a very dead thread, but thought I'd leave it anyway.

We did (unknowingly of this thread) code up a program to do combined model selection and partitioning scheme selection. It's available here: http://robertlanfear.com/partitionfinder/

In short, it does almost everything modeltest and ProtTest do, but it also finds optimal partitioning schemes too. There are some limitations - the user has to specify data blocks (a.k.a starting partitions). But I'm working on data-driven approaches to proposing new partitions as well.

It's reasonably flexible, and does most of the things people on this thread have suggested (almost precisely what Jeet proposed in fact). We considered the approach in Li et al, but decided against it, there's an accompanying publication in MBE that describes why.

Cheers,

Rob