DPP, which is sometimes referred to as the "Chinese restaurant" process, is a nonparametric Bayesian approach used in clustering problems where the data elements are assigned to a set of discrete clusters. Importantly, under the DPP, the number of clusters and the assignment of data elements to clusters are treated as random variables. Phylogenetic applications of the DPP include detecting positive selection in protein-coding DNA sequences (by assigning nucleotide sites to various dn/ds classes: Huelsenbeck et al., 2006), and accommodating among-site variation in substitution rates (by assigning nucleotide sites to various substitution-rate classes Huelsenbeck & Suchard, 2007).
This latter application was the subject of the talks by Oaks and Linkem, who presented results on the ability of the DPP approach for accommodating among-site substitution rate variation (ASRV) using both empirical and simulated data sets. Their results demonstrate that accommodating ASRV under the DPP can significantly improve the marginal likelihood scores relative to conventional methods that assign data partitions to substitution rate classes a priori (such as those based on codon position, etc.). In some cases, estimates under the DPP surpassed those of conventional approaches by hundreds or thousands of log likelihood units! As is well known, the studies by Oaks and Linkem also confirmed that more adequately capturing ASRV can lead to substantially different topologies.
Given that this is the case, why aren't more people using DPP? One issue emphasized by Oaks and Linkem relates to the substantial computational burden associated with inference under the DPP. For example, their analysis of two ~30 taxon data sets required more than two months of run time! Computational expense not withstanding, John Huelsenbeck (and affiliated phylo-geeks at Berkeley...Yeah, I'm looking at you Brian Moore) are working to provide faster, more user-friendly implementation of DPP-based methods that are bound to change our lives forever. Stay tuned for the latest...
This latter application was the subject of the talks by Oaks and Linkem, who presented results on the ability of the DPP approach for accommodating among-site substitution rate variation (ASRV) using both empirical and simulated data sets. Their results demonstrate that accommodating ASRV under the DPP can significantly improve the marginal likelihood scores relative to conventional methods that assign data partitions to substitution rate classes a priori (such as those based on codon position, etc.). In some cases, estimates under the DPP surpassed those of conventional approaches by hundreds or thousands of log likelihood units! As is well known, the studies by Oaks and Linkem also confirmed that more adequately capturing ASRV can lead to substantially different topologies.
Given that this is the case, why aren't more people using DPP? One issue emphasized by Oaks and Linkem relates to the substantial computational burden associated with inference under the DPP. For example, their analysis of two ~30 taxon data sets required more than two months of run time! Computational expense not withstanding, John Huelsenbeck (and affiliated phylo-geeks at Berkeley...Yeah, I'm looking at you Brian Moore) are working to provide faster, more user-friendly implementation of DPP-based methods that are bound to change our lives forever. Stay tuned for the latest...