Cladistics is a method of deriving possible family trees, or cladograms, from a set of specimens. It works by measuring the states that a selected set of characters take in each of the specimens, and finding the trees in which the fewest transitions occur between different states. (The total number of state transitions in a tree is called its length.)
The process is related to, but not directly bound up with, both the ``cladistic philosophy'' that the only legitimate groupings are clades (see ``So can we say that birds are dinosaurs?'' ) and the ``cladistic taxonomy'' which eschews Linnaean ranks (see ``What do terms like phylum, order and family mean?'' )
A trivial worked example will make this process much clearer.
Suppose we want to work out the relationships between three genera: Apatosaurus Brachiosaurus, and Camarasaurus (which conveniently happen to begin with A, B and C.) The chances that any of these is directly descended from any of the others is vanishingly small, so we do not consider trees in which the taxa under consideration appear at the branch points. Accordingly, we have three possible trees, corresponding to which of the three genera was first to branch off the lineage that eventually gave rise to both of the others:
tree #1 tree #2 tree #3 B C A C A B \ / \ / \ / A \/ B \/ C \/ \ / \ / \ / \/ \/ \/
(Only the topology of these trees is significant, not the geometry: so, for example, if B and C were swapped in tree #1, the resulting tree would be equivalent to #1.)
In order to determine which of these candidate trees is the most likely, we need to select a set of characters that we are going to analyse for each taxon, and make the appropriate measurements on our specimens. The results are entered in a matrix.
For this example, we choose the following characters:
This gives us the following matrix:
Character number | ||||
1 | 2 | 3 | 4 | |
Apatosaurus | no | yes | no | no |
Brachiosaurus | yes | no | no | yes |
Camarasaurus | yes | yes | yes | yes |
Now consider the three putative trees above. Assuming that the common ancestor of all three taxa lacks all four characters (see below), we find that:
The principle of parsimony says that, other things being equal, we should expect as few transitions as possible - that is, we assume the simplest evolutionary sequence that accounts for the observable facts. So we consider tree #1 as the most likely of the three alternatives presented here - and indeed, this is the tree that most workers consider to be correct, with both Brachiosaurus and Camarasaurus falling within the group called Macronaria, and Apatosaurus outside that group.
For this tiny example, we've been able to work out the alternatives ``by hand'', but as we start to consider more taxa, then number of possible trees grows frighteningly fast. Three taxa give rise to three possible trees, and four to fifteen; but as few as seven taxa yield ten thousand trees, and ten taxa can be arranged into more than 34 million trees!
It's clear that computers are required to analyse large data-sets; and the availability these days of powerful computers is part of the reason that cladistics is a relatively new approach to systematics. Typically the data matrices - perhaps of dozens of taxa and hundreds of characters - are fed to a computer program such as PAUP (Phylogenetic Analysis Using Parsimony) or MacClade, and a lot of magic goes on behind the scenes: heuristics are required to slice off chunks of the search space, since for large data-sets, the total number of theoretically possible trees is too great even for computers to consider.
In addition to the difficulties presented by this trivial example, more problems arise when doing cladistic analysis on real data sets: for example, due to the incomplete nature of the fossil record, we rarely have all the data: many characters have to be coded as ``don't know'' for some taxa. Programs such as PAUP need to be able to deal with this, as well as multi-state characters such as ``End of tail unspecialised (0), whiplash (1) or club (2)''.
Sophisticated cladistic analysis programs should also allow characters to be weighted by importance, so that (for example) when taxa share the same value for a character x, this can be considered twice as likely to imply that they are related as if they shared the less significant character y.
However, like any computerised method, cladistics suffers from the problem that the quality of the output it produces is entirely dependent on the quality of the input it is given - or, to put it another way, ``garbage in, garbage out''. There are several choices that the worker must make before the computer, in all its objectivity, can be put to work, and some further issues may also cast doubt over the results of a cladistic analysis.
Poorly chosen characters can yield incorrect results. In the example above, if character 4 were replaced with ``hind legs longer than forelegs'' - not an unreasonable choice - the analysis would choose tree #2 as most parsimonious, with A and B more closely related to each other than to C.
To take an even more extreme example, if we applied the cladistic method to three taxa, lions, tigers and zebras, using only the single character ``has stripes'', we would obviously get the nonsensical result that tigers and zebras are more closely related to one another than either is to lions.
This highlights the need to choose an appropriate set of characters for a cladistic analysis. How can this be done?
Another way in which a worker's preconceived ideas can affect the result of a cladistic analysis is in the assumptions about the primitive states of the characters. Since we can't know what taxon is actually the most recent common ancestor of those being analysed, we use a well-understood outgroup as a proxy for that ancestor - that is, a taxon outside of, but as close as possible to, the clade containing the taxa to be analysed. For example, in the analysis above, if we think that the phylogeny looks like this:
A, B and C (in some combination) \ | / \|/ V Haplocanthosaurus \/ \ Jobaria \/
Then we might perform the analysis on the assumption that A, B and C's common ancestor had the same character states as Haplocanthosaurus; or, if we decided that its remains are too fragmentary to be used in this way, we might use the less closely related but better represented Jobaria. In practice, several different outgroups are typically analysed to help determine the most likely ancestral state of characters.
However, the choice of taxa to use as outgroups is clearly a subjective one: it is chosen on the basis of how the phylogeny is likely to look. To pick an extreme example, consider an analysis of the relationship between various deinonychosaurs and cretaceous birds. Most workers (believing that birds are dinosaur descendants) would use something like Oviraptor; but someone who believes that birds evolved from a non-dinosaurian ancestor such as Megalancosaurus would have to choose a much more primitive outgroup such as a basal archosauromorph. This would obviously affect the results of the analysis significantly.
Classically, cladistics works only with fossil morphology, but of course we have other knowledge which may contradict the results of cladistic analysis. For example, we should be wary of any analysis which suggests that a Triassic taxon is more derived than a Cretaceous one; or that a Laurasian taxon evolved from a Gondwanan lineage during the time that the two supercontinents were separate.
Some work is being done on modifying cladistics algorithms to take specimen age into account, but it's too early to say whether this will have much effect on how things are done.
In choosing the most likely candidate tree, cladistics programs rely on the principla of parsimony; but nature is not always parsimonious! As John Jackson points out, ``Lineages of animals have a way of evolving a feature, then removing it, and then re-evolving it again, in a way they have often had to be spoken to about.''
This is called a reversal. For example, sufficiently basal ancestors of birds were flightless, and birds evolved the feature of flight; but flightless birds such as ostriches and penguins have lost that feature. An analysis using the character ``can fly'' would be fooled into misreading this as evidence that penguins are more closely related to bird-ancestors than to other birds.
Finally, there is an implicit assumption that similarity of form - which is what cladistic analysis discovers and measures - implies commonality of descent. This is usually a good assumption, but not always. See Jonathan R. Wagner's article ``What is a cladogram anyway?'' at www.dinosauria.com/jdp/misc/cladogram.html for more discussion of this distinction.
Given these problems, why use cladistics at all? For all its limitations, it does offer the following advantages over older methods of hypothesising phylogeny:
(A person more cynical than myself might speculate that it's for this very reason that BAND (Birds Are Not Dinosaurs) proponents are among the few groups unconvinced by the merits of the cladistic method.)
In other words, the primary benefit to cladistic analysis may not be that the results are ``better'' so much as that systematists have to ``show their working'', which makes it much easier for others to improve upon it in the future.
As an example of this kind of openness, see Paul Sereno's data matrix for the whole of the dinosauria at www.sciencemag.org/feature/data/1041760.shl - a matrix of 146 characters scored against thirteen dinosaurian subgroups. As new data is discovered, other workers can fill in some of the question marks in this matrix and re-run the analysis, hopefully yielding improved results.
In summary, while cladistic analysis is a powerful tool, it is not a ``silver bullet''. Human interpretation is still vital in the business of systematics. For this reason and because of the constant discovery of new specimens, the results of every phylogenetic analysis are provisional, subject to change in future research.
For a much more complete description and discussion of cladistics, see www.ucmp.berkeley.edu/clad/clad1.html and the linked pages.