Detecting branch length outliers on gene trees

One of these things is not like the others, one of these things just doesn’t belong.

Phylogenetic analysis in non-model organisms is increasingly relying on many hundreds of nuclear loci. As the dataset gets large, it is tempting to rely on automated methods for churning unaligned sequence data into alignments, gene trees, and species trees without even checking on the intermediate data. I’m sure we all realize that many artifacts can be introduced along the phylogenetic pipeline, and if they are not removed can result in erroneous analysis.

For example, one dataset I am working with involves sequences derived from two sources: HybSeq (targeted exon sequencing) and RNASeq (transcriptome sequencing). When aligned together, sequences from the two methods produce very different patterns of missing data. Some alignments end up looking like this:


The sequence QEBC is clearly poorly aligned to the rest of the sequence, and creates a highly suspect gene tree. As someone who “came of age” in an era where staring at ITS alignments for days was standard practice, the thought of manually inspecting hundreds of alignments, potentially introducing more errors than I attempt to fix, makes me want to give up on phylogenetics altogether. There has to be a way to flag problem sequences automatically so that only a subset of sequences need to be manually checked.

My solution was to write a script which processes the gene trees for “branch length outliers.” For my dataset, I defined an outlier as 25% of the maximum tree depth (root to tip). I later increased this threshold to 50% for non-terminal branches, and to 75% for branches whose descendants are only outgroup taxa. These values can all be changed when I run the script. 

The Python script requires the ETE3 package (make sure it’s installed with all its graphical dependencies) and is freely available here (or see the end of this post).

First, create a file that has the names of the outgroup taxa, one per line (outgroups.txt).


And a text file containing a list of gene names (gene_names.txt):


Then you would call the script on all the gene trees in a directory with GNU Parallel by:

parallel --tag python {}.tre --outgroups outgroups.txt :::: gene_names.txt > brlen_outliers.txt

A list of nodes that exceed the branch length thresholds is printed to stdout, and a PNG file is generated that highlights the outgroup taxa (blue) and the outlier branches (red). PNG files will only be generated for genes with outlier branches. Here’s the same gene as above (click to make bigger):

branch length outlier

Since I know that QEBC is supposed to be related to RJNQ, this is an easy sequence to flag and remove. GNU Parallel’s –tag string allows the name of the file to be printed to stdout along with the nodes, so the names of problem branches can be easily copy/pasted into a file:

AT1G72440_23096 --QEBC
AT1G73440_31588 --NQWS
AT1G73960_34485    /-QEBC
AT1G73960_34485 --|
AT1G73960_34485   |  /-RJNQ
AT1G73960_34485    \-|
AT1G73960_34485      \-Lopezia_racemosa_racemosa_MJM_2685
AT1G73820_18200   /-Oenothera_sinuosa_MJM555
AT1G73820_18200 --|
AT1G73820_18200   | /-Oenothera_suffrutescens_MJM596
AT1G73820_18200   \-|
AT1G73820_18200     \-Oenothera_speciosa_MJM1368

By combining the PNG files, I was able to quickly scan through the gene trees and note which taxa should obviously be removed. For this dataset, I started with 319 loci. The script flagged 177 genes, which I was able to process in about an hour. In many cases it was single branches that appeared as obvious outliers, but in some cases a couple of sequences (usually all derived from transcriptomes) were in a clade together, and all sequences could be removed.

In some cases, it was clear that a clade should not be removed for having a particularly long branch; after all, rate heterogeneity does happen. In these cases the scale bar was also a good clue– if the entire tree depth was way longer than most other trees, I would be more likely to flag a branch outlier for removal.

The full code is available below. I hope it will be useful for others in our joint quest to not ignore messy data in phylogenetic analysis!



Leave a Reply

Your email address will not be published. Required fields are marked *