DPP, which is sometimes referred to as the "Chinese restaurant" process, is a nonparametric Bayesian approach used in clustering problems where the data elements are assigned to a set of discrete clusters. Importantly, under the DPP, the number of clusters and the assignment of data elements to clusters are treated as random variables. Phylogenetic applications of the DPP include detecting positive selection in protein-coding DNA sequences (by assigning nucleotide sites to various dn/ds classes: Huelsenbeck et al., 2006), and accommodating among-site variation in substitution rates (by assigning nucleotide sites to various substitution-rate classes Huelsenbeck & Suchard, 2007).
This latter application was the subject of the talks by Oaks and Linkem, who presented results on the ability of the DPP approach for accommodating among-site substitution rate variation (ASRV) using both empirical and simulated data sets. Their results demonstrate that accommodating ASRV under the DPP can significantly improve the marginal likelihood scores relative to conventional methods that assign data partitions to substitution rate classes a priori (such as those based on codon position, etc.). In some cases, estimates under the DPP surpassed those of conventional approaches by hundreds or thousands of log likelihood units! As is well known, the studies by Oaks and Linkem also confirmed that more adequately capturing ASRV can lead to substantially different topologies.
Given that this is the case, why aren't more people using DPP? One issue emphasized by Oaks and Linkem relates to the substantial computational burden associated with inference under the DPP. For example, their analysis of two ~30 taxon data sets required more than two months of run time! Computational expense not withstanding, John Huelsenbeck (and affiliated phylo-geeks at Berkeley...Yeah, I'm looking at you Brian Moore) are working to provide faster, more user-friendly implementation of DPP-based methods that are bound to change our lives forever. Stay tuned for the latest...
This latter application was the subject of the talks by Oaks and Linkem, who presented results on the ability of the DPP approach for accommodating among-site substitution rate variation (ASRV) using both empirical and simulated data sets. Their results demonstrate that accommodating ASRV under the DPP can significantly improve the marginal likelihood scores relative to conventional methods that assign data partitions to substitution rate classes a priori (such as those based on codon position, etc.). In some cases, estimates under the DPP surpassed those of conventional approaches by hundreds or thousands of log likelihood units! As is well known, the studies by Oaks and Linkem also confirmed that more adequately capturing ASRV can lead to substantially different topologies.
Given that this is the case, why aren't more people using DPP? One issue emphasized by Oaks and Linkem relates to the substantial computational burden associated with inference under the DPP. For example, their analysis of two ~30 taxon data sets required more than two months of run time! Computational expense not withstanding, John Huelsenbeck (and affiliated phylo-geeks at Berkeley...Yeah, I'm looking at you Brian Moore) are working to provide faster, more user-friendly implementation of DPP-based methods that are bound to change our lives forever. Stay tuned for the latest...
16 comments:
Thanks to Brian Moore for improving this post by clarifying the misleading parallel I drew between DPP and partitioned analyses when I initially posted last weekend.
You, Rich Glor, have a gift for Photoshop. Jamie is especially hysterical in your version.
funny stuff
So with the simulated data, does the DPP method recover the known topology better than other approaches?
@anonymous
I actually missed Oaks's presentation on the simulated data (my student was presenting concurrently), so I can't enlighten you on how the method performed in this case. Perhaps Jamie or Charles can fill us in? I realize I'm addressing a different question now, but from Linkem's presentation it seemed that topological differences among analyses of the two empirical data sets were more profound in one data set (scincids) than the other (anoles).
@Susan
Three rules for making silly Photoshop images: (1) never work for more than 10 minutes, (2) feather edges when pasting images, (3) rotate heads appropriately when pasting onto a new body.
Thanks for the interesting post, Rich. I think the biggest reason the DPP isn't being used very much yet is that there is no publicly released software that implements it (to my knowledge). The papers that have come out describing how it can be applied all used software developed just for that paper. I'm sure that will change soon, but for the moment...
That's also an important point to keep in mind when looking at run times and convergence rates for analyses using these initial implementations. For instance, in the version of the ASRV DPP software that Oaks and Linkem used, only limited topological moves (NNI) were implemented. Run times will probably always be an issue for DPPs on a large scale, but they should improve substantially over early trials.
I'm really excited about Jamie and Charles' work; my enthusiasm isn't just because Charles has the cajones to work on the biggest phylogenetic nightmare in scincid lizards (the Sphenomorphus group), or because both of them work on non-mammalian amniotes.
Agnostic partition selection, as opposed to a priori partitioning (i.e., partitioning from prior knowledge of the evolution of codon positions and RNA stems/loops), is in my opinion, the Holy Grail of Phylogenetics*. Unfortunately, we didn't have a way to implement it back in aught-4.
As Jeremy B. pointed out here (and indeed, Charles and Jamie noted in their talks), we are currently limited by the programs that can implement DPP for partitioning strategy. NNI tree proposals just ain't gonna cut it for complex data sets. Soon, I hope we'll have a better way to implement the DPP for partitioning strategies with better tree swapping with SPR or TBR.
*Matt Brandley wins the award for longest run-on sentence on the Dechron blog??
Also, to Rich G., I'm adding the "yeah, you know me".
Hi,
I just would like to point out that PhyloBayes seems to do exactly what you are praying for using DPP .
@fdelsuc
Thanks for calling phylobayes to our attention. Perhaps you'd be willing to share your experiences with this program? I just skimmed through the manual and couldn't tell how far this program goes beyond the Huelsenbeck program used by Oaks and Linkem in terms of tree reconstruction. Do you know what types of topological moves are permitted (NNI, TBR, etc).
Excellent! I can't believe I overlooked PhyloBayes. Hopefully this will be used increasingly in empirical studies.
You may also want to have a look at Blanquart & Lartillot's CAT-BP model implemented in Nh-Phylobayes.
DPP sounds great, if the runtime problem can be resolved.
But if there's a Holy Grail of phylogenetics, I would consider it to be simultaneous alignment and phylogenetic analysis -- what POY tries to do in a limited way. The problem with Holy Grails is that they tend to be computationally intractable.
Someone else's idea of the Holy Grail might be the integration of gene trees and species trees, another computational nightmare.
Maybe there are lots of Holy Grails.
Maybe there are lots of Holy Grails.
How about this? Simultaneous alignment and tree estimation for large (1000's of taxa+), full-genome datasets, while taking the coalescent into account (thereby estimating the species trees within which the gene trees nest), not requiring me to try to delineate partitions beforehand, and automatically tapping into a massive fossil database to estimate divergence times.
All on my desktop. While I'm having lunch.
But if this existed...what would I do to stay busy? Ah, yes! Comment on more blog posts! ;-)
According to the author, PhyloBayes implements a sophisticated SPR branch swapping to move into tree space. Based on my own empirical experience, I would say that this swapping seems to be as efficient as the one MrBayes implements, and it performs even better in some cases. The main difference with MrBayes, besides the DPP and the innovative CAT model, is that PhyloBayes does not use Metropolis Coupling of MCMC. It is therefore requiered to run multiple independent chains in parallel. PhyloBayes gives its best when using the CAT+G model on heterogeneous amino acid data such as complete mitochondrial genomes for example. It is worth trying on your favourite datasets. Of note, the PhyloBayes "cycles" do not correspond to MrBayes "generations". This should be kept in mind when monitoring the convergence of MCMC.
@fdelsuc
Thanks for the info, I'll have to give this program a spin.
Post a Comment