Wednesday 17 April 2013

How to read a phylogenetic tree

This week I have been preparing my last phylogenetics lectures and practical of the year. Something that is quite clear when marking student work is that many students have no idea how to read a phylogenetic tree and identify the key features about it that aid interpretation. To help combat this, here is my basic guide of "How to read a phylogenetic tree".

Topology

The first and possibly most important thing about any tree is the topology - the branching order. It's easy to get distracted by the direction or style of the tree (e.g. curved branches versus straight) but none of these things matter for the topology. The key thing here is the path taken from one node to the other and how the species (or molecules but I will just refer to species here for clarity) cluster together. The "direction" of the tree and whether the (terminal) "leaf" nodes are at the top, bottom, left or right is not important. (I'll come back to this below, for "rooting") Likewise, the vertical ordering of nodes is not inherently important - whether a particular node appears at the top, bottom or somewhere in the middle of the tree is largely a matter of preference that will depend on the purpose of the tree and the story you are using it to tell.

In fact, it is easier to look at the branches in a tree, as nodes can be rearranged in a way that can at first appear confusing. These four trees, for example, all have the same topology:
Each branch can be thought of as dividing the tree in two and splitting the species into two accordingly. If two trees share a topology, their branches will make the same splits, even if their (in this case) vertical ordering is different. Trace, for example, the path from A to D. This is easiest first in the top left tree. The branch leading directly to A splits the tree into A:BCDE. The next splits AB from CDE. (This has the root, which I will come to, below.) Now, moving back out from the root to the tip, we travel along a branch that splits ABE:CD before finally the branch leading only to D and splitting it from ABCE. Tracing the path of A to D in any of the other three trees will take exactly the same route. Any other tip to tip journeys will likewise be the same in these four trees.

This is particularly important when comparing trees, particularly big ones. I have seen people invest a lot of time and effort (and sometimes manuscript space) speculating about the differences between two particular trees when, in fact, they were really the same tree and there was no difference. Alternatively, the topology might be the same but the differences might just be due to where the tree was rooted, which I will return to.

Branch Lengths

Once the topology is clear, the next things to look at are the branch lengths, as these can give key insights into how the tree can be interpreted and, sometimes, even the methods behind the tree. There are two key things to look at in this respect: (1) the distance between (not necessarily connected) internal nodes, shown with the red arrows below, and (2) the root-to-tip distances for each terminal node, shown by the coloured arrows in the figure below:
If the spacing is even (i.e. all the red arrows are the same length) then it is highly likely that branch lengths are not being shown and the tree is only displaying the toplogy. This can be confirmed by (a) the lack of a scale bar, and (b) a bias towards internal nodes towards the tips. (In the left tree, the node joining AB is aligned with that joining CD, not the deeper CDE ancestor.) If the spacing is not even then branch lengths are being shown. These should really be accompanied by a scale bar (although the figure about does not have any).

If branch lengths are being shown, the next thing to look at is the total root-to-tip distance for each terminal node. (The coloured arrows in the figure above.) If these are all the same length, as in the right-hand tree, it is highly likely that a molecular clock has been assumed (if it's a molecular phylogeny). If it hasn't been assumed - and the methods should provide enough details to know - then the molecule in question is just evolving in an incredibly clock-like fashion. More usually, these root-to-tip distances will not all be the same. If the tree is topology-only, as in the left-hand tree, the equal root-to-tip distances do not mean anything and no conclusions about rates can be reached.

Rooting

Evolutionary trees are (almost) always starting with an ancestor and then dividing, so you can always identify the root (if there is one) as the point where all the branches converge. Historically, it was drawn at the bottom like a real tree (as with the great Molluscan tree in OUMNH and the OneZoom Tree of Life Explorer). These days, it is usually drawn on the left as in these diagrams but I have seen trees with the root at the top, bottom or even on the right. (The latter is usually only used when mirroring another tree.) I have posted before on how to root a phylogenetic tree, so I won't go over that again here. The rooting method should be given in the methods but, when it is missing, you can often guess from the shape of the tree and using the root-to-tip branch lengths again:
Unrooted trees are pretty obvious when shown in the "radiation" style. If the tree is rooted, it is almost certainly either midpoint rooted or outgroup rooted (see "how to root a phylogenetic tree"). Midpoint rooting can be identified by virtue of the fact that the two longest root-to-tip distances will (a) be the same length and (b) be either side of the root. If either of these conditions is broken, it is not midpoint rooted and is probably outgroup rooted. (Note that if both conditions are met, it is still possible that the tree is outgroup rooted. Indeed, if the evolutionary rates are fairly consistent, outgroup rooting and midpoint rooting should be the same.)

Ideally, a rooted tree should have the root marked. Sometimes, however, it is left off, as in the bottom left. This can be confusing as tree visualising programs will often display trees in the "traditional" style even when they are not rooted. This is particularly a problem when branch lengths are not shown as it will not be at all obvious when the tree is rooted or not. The time that I see this catch people out most is when making a Maximum Parsimony tree using the popular software, MEGA - these trees are displayed randomly rooted and without branch lengths by default.

Reliability and Confidence Metrics

It is always important to consider how reliable the rooting method used is likely to be if conclusions are being reached regarding the direction of particular evolutionary events. Despite this, it's pretty rare for the root position to have a direct confidence measure associated with it (although I am sure there are ways to do it). What is common, however, is to have confidence metrics for the internal branches, which are usually placed above (or sometimes below) the branch next to the descendant node (in red, below). (Branch lengths, when shown, are normally below and nearer the middle of the branch.)
Bayesian and Maximum Likelihood methods quite often produce branch probabilities as part of the method but otherwise the most common method is "bootstrapping", which is a random sampling method. I will save bootstrapping and branch tests for future posts. My one tip for now: always remember that bootstrap values are associated with branches and not nodes.

Phylogenetic checklist

In summary, my checklist for reading a phylogenetic tree: topology ⇒ branch lengths? ⇒ molecular clock? ⇒ rooting ⇒ branch confidence metrics.

6 comments:

  1. Hi - Thank you for this blog post! I have a question - Is it possible to select an outgroup as the root and still obtain bootstrapping values for the branches? Or must you chose one or the other? Thanks!

    ReplyDelete
    Replies
    1. Hi Ana. Yes, you can have outgroup (or any other) rooting AND bootstraps - they are independent. Bootstrapping is normally performed on an unrooted tree. (Although it is possible in principle to bootstrap mid-point rooting.)

      Delete
  2. When a phylogenetic tree is rooted under the assumption of the molecular clock (Bayesian Molecular Clock Rooting), what represent the scale bars?

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. The article is very useful!
    However the first 3 pictures were gone.
    Could you please reupload the pictures?
    Thank you so much!

    ReplyDelete
  5. Thank you for posting this great article!
    Although posted some time ago, it will be great if you could make the first three figures visible again.
    Thank you for your efforts.

    ReplyDelete

Thanks for leaving a comment! (Unless you're a spammer, in which case please stop - I am only going to delete it. You are just wasting your time and mine.)