Saturday, October 18, 2008

R Tip: Labeling Trees w/ Posterior Probability and Bootstrap Support

What's the worst part about making figures for phylogeny papers? When I sat down yesterday to make some new figures I realized that my answer was the same as it was ten years ago: adding support values to each node by hand using programs like Adobe Illustrator. I decided I was never going to go through this archaic error-prone process again. In sharing my solution here, I hope that none of you will either! The text below is an annottated R script that does most of the hard work for you (you'll still have to tweak a few things in Illustrator). Of course, you'll need to get familiar with basic R functions for phylogentics before attempting to use these scripts. The best way to do this is to pick up a copy of Paradis' nice little introductory textabe/half/amzn/bn. Don't be lazy and tell yourself you can label your nodes by hand quicker than you could learn R. This might be true if you only had to label one tree, but what are you going to do for your next paper, or when the reviewers advise you to re-run your analyses? If you know how to use the scripts below you'll just be able to plug any tree in and have your labeled tree almost instantly!

#Fist we need to open some necessary libraries
library(ape)
library(geiger)
#The getAllSubTrees function below is a necessary subfunction that atomizes a tree into each individual subclade and was provided compliments of Luke Harmon.
getAllSubtrees<-function(phy, minSize=2) {
res<-list()
count=1
ntip<-length(phy$tip.label)
for(i in 1:phy$Nnode) {
l<-node.leaves(phy, ntip+i)
bt<-match(phy$tip.label, l)
if(sum(is.na(bt))==0) {
st<-phy
} else st<-drop.tip(phy, phy$tip.label[is.na(bt)])
if(length(st$tip.label)>=minSize) {
res[[count]]<-st
count<-count+1
}}res}

#The plotBayesBoot function below plots both posterior probability and bootstrap values on each node of the consensus tree obtained from your Bayesian analysis. Bootstrap values will appear in bold text immediately below and to the left of the node they support, whereas Bayesian posterior probabilies will appear in regular face above and to the left of the node.

plotBayesBoot <- function(bayesTree,bootTree) {
getAllSubtrees(bayesTree)->bayesSub
getAllSubtrees(bootTree)->bootSub
bootList<-matrix("<50",nnode(bayestree),1)
#The commands below compare all the subclades in the Bayes tree to all the subclades in the bootstrap tree, and vice versa, and identifies all those clades that are identical.
for(i in 1:Nnode(bayesTree)) {
for(j in 1:Nnode(bootTree)) {
match(bayesSub[[i]]$tip.label[order(bayesSub[[i]]$tip.label)], bootSub[[j]]$tip.label[order(bootSub[[j]]$tip.label)])->shared
match(bootSub[[j]]$tip.label[order(bootSub[[j]]$tip.label)], bayesSub[[i]]$tip.label[order(bayesSub[[i]]$tip.label)])->shared2
if(sum(is.na(c(shared,shared2)))==0) {
bootTree$node.label[j]->bootList[i]
}}}
plot(bayesTree, cex=1, lwd=0.5) #Plots your Bayesian consensus tree
nodelabels(bayesTree$node.label, adj=c(1.2, -0.3), frame="n", cex=1, font=1) #Adds posterior probability values to the tree. Change the 'cex' value to make the fond smaller or larger. A value of 1 will give you a readable result in the R quartz window, but a value closer to 0.25 might be better for publication)
nodelabels(bootList, adj=c(1.4, 1.3), frame="n", cex=1, font=2) #Adds bootstrap values.
}


#Now you're ready to read your trees in and make your figure!
read.nexus("yourBayesTree.con")->bayesTree #Reads in the .con file that results from analyses in MrBayes.
bayesTree[[1]]->bayesTree #Extracts one of the two trees in the .con file.
read.nexus("yourBootTree.nex")->bootTree #Reads in the consensus tree from a bootstrap analysis in PAUP.
plotBayesBoot(bayesTree, bootTree)

9 comments:

Todd Jackman said...

Thanks, Rich - this (you giving detailed instructions) is a great way to slowly learn R without having to read the giant R book much.
Can you rotate branches or ladderize?
This is important for showing multiple similar trees or for getting the trees to match geography.
Also, if the Bayes consensus has polytomies, I would prefer showing the Bayes tree with best ML score, but that can be done easily.

Luke J. Harmon said...

There is an r function in ape, ladderize(), that will ladderize a tree for plotting. If you run your tree through that before running Rich's script, I think it will do what you want.

Simon said...

Have you tried FigTree? It's quite clever as it'll let you plot anything stored as a comment in the Newick description on the tree nodes/edges. So, you give it something like this:

(taxa1[bootstrap=100&pp=0.99],taxa2[bootstrap=85&pp=0.88])...etc

..thenyou can choose what/how to label the nodes. BEAST automatically stuffs every single parameter into the tree description for FigTree to use, but it's fairly trivial to reverse engineer other software to do the same.


--Simon

Jeet Sukumaran said...

Hi guys,

I wrote SumTrees specifically for this purpose.

http://jeetworks.org/programs/sumtrees

Takes one or more tree files as input, and a target tree file, and plots split support based on the former onto to the latter.

It also has some other options such as skipping a burn-in, composing a consensus tree if a target tree is not given, etc.

Glor said...

Jeet - Thanks for pointing out your program. I have yet to give it a try, but it sounds very useful. It will be nice to able to plot from the original data rather than from the consensus trees, as I've done with the R script. It sounds like it will also make it easier to do what Todd has suggested and plot support values on the tree with the highest ML score.

Jeet Sukumaran said...

Rich:

Yep, in fact the original motivation was to do exactly what Todd requested. But then I later decided that I sometimes might be interested in summing clade posterior probabilities over multiple tree files as well, but *not* mapping it onto an ML tree (i.e., I wanted clade support but wanted to integrate out topology). So I added the burn-in and consensus tree option. When I get around to the next version release, I'd like to add an option to summarize the posterior probabilities of node ages (=tip-to-node distance) as opposed to just edge lengths as it does now.

By the way, great site---it's definitely one of my regular/daily ports-of-call now!

-- jeet

Glor said...

Although I have yet to try this approach, Casey Dunn suggested that Phyutility - the application he wrote with Stephen Smith - will also perform the type of analyses implemented by Jeet's SumTrees program

Nate U said...

Hi Rich-- does this function still work for you, or have you since updated it? I just tried on R v3.0.1 and am getting a series of errors. Seems like a useful tool so I'd like to give it a try!
Thanks

Carol VERHEECKE said...

use Mega6 !