Targeted bait capture (HybSeq) is a technique for sequencing many loci simultaneously based on bait sequences. Once the sequence data is generated, usually by Illumina, the next task is to sort the reads based on the target loci and assemble them. I generated this pipeline to work with data we are using to reconstruct the phylogeny of pleurocarpous mosses using nuclear coding regions from more than 800 genes. We generated bait sequences by mining the Physcomitrella patens proteome for proteins that had homologs in the pleurocarpous moss transcriptomes sequenced by the OneKP project. Our baits were generated by the folks at MYcroarray.
This pipeline starts with Illumina reads from a single species, and assigns reads to target genes using BLASTx (to protein sequences) or BWA (to DNA sequences). The reads are distributed to separate directories for each gene, where they are assembled separately using Velvet and CAP3. Coding sequence (CDS) corresponding to the bait sequences are extracted using Exonerate. The main output is one FASTA file per protein, containing the CDS portion of the homologous protein sequence. Other helper scripts are included to help analyze the output of the pipeline. To download the scripts, please visit my github repository!