BOSS: Boxes of Sequence Similarity

Robert Flegg1, Malcolm Simons2
1robert.flegg@med.monash.edu.au, GeneType Pty. Ltd., Fitzroy Vic 3065, Australia and Victorian Bioinformatics Consortium, PO Box 53, Monash University, Clayton Vic 3800, Australia; 2mjsimons@optusnet.com.au,

The phylogenetic analysis of an alignment assumes that the sequences are derived from a common ancestor. Recombination confounds this analysis and leads to an alignment showing a mosaic of blocks with different evolutionary histories. Regions bounded by recombination events do not conform to gene or intron/exon boundaries and may be extensive or relatively short. One of us (MJS) has coined the term "phylon" to name these blocks of evolutionary history. From an evolutionary point of view, to designate a region as a phylon implies that the sequences in this region share a common ancestral sequence and that no recombination event has occurred within it. From an operational point of view a phylon constitutes a region within an aligned set, or sub-set, of sequences in which the sequences share some high level of sequence similarity. It is desirable to identify within a multiple sequence alignment the locations of recombination events and the set of sequences to which they apply. This poster describes a program that performs this analysis for very similar sequences. The alignment used as an illustration is drawn from the MHC region of the human genome. The MHC region is rich in duplication, recombination and mutation. The level of sequence similarity is very high and the number of alleles known at different loci ranges from 511 at the B locus to only 1 at the F locus. The alignment here consists of 28 sequences spread across 5 major alleles, extends for 320 bases and includes 67 positions with a variation in at least one of the sequences. A phylogenetic analysis of the block as a whole fails to describe the intricate detail of this alignment. The program is named BOSS, Boxes of Sequence Similarity, and is written in C using the EMBOSS libraries. The program analyses a sequence alignment by counting the level of sequence similarity between every pair of sequences within a sliding window. Regions with similarity above a user-selected threshold are stored. In the next phase this pairwise similarity information is parsed to identify the blocks or boxes within the alignment. Each box consists of a set of sequences that share similarity above the threshold and the region in the alignment for which this applies. The program itself has no implied limit on the number of sequences or the length of sequence that can be analysed. The program lists all of the boxes present in the alignment that satisfy the threshold criterion. A number of properties are derived for each box. The properties include the sequences that define the box, its extent within the alignment and a list of those base positions that can be used to characterise the box. The length and depth of each box tells us about the relationships between the included sequences. BOSS finds boxes that span the whole of the alignment, one for each of the major alleles. Within these boxes the sequences have a very high level of similarity, indeed that similarity may extend for thousands of bases beyond the region considered here. BOSS also finds boxes that span shorter regions but that include sequences from several major allele classes. These boxes identify regions within the alignment where the signature of a common ancestor is still present and the boundaries of these boxes represent the sites of recombination events.