Assembly and finishing tools for repeated and polymorphic genomes

Martti T. Tammi1, Erik Arner2, Ellen Kindlund, Björn Andersson
1martti.tammi@cgb.ki.se, Center for Genomics and Bioinformatics, Karolinska Institutet; 2erik.arner@cgb.ki.se, Center for Genomics and Bioinformatics, Karolinska Institutet

As more and more different organisms are being sequenced using whole genome shotgun sequencing, increasingly complex genomes are encountered. These genomes often contain several embedded repeats of varying length and degree of difference between copies. An additional complicating factor is that some copies may be in part identical. These are particularly difficult to assemble when the copy length is longer than the average read length Also, some organisms have a high degree of polymorphism between homologous chromosomes, which complicates the task of assembly even in the presence of mate-pair information. As an example, in the parasite T. cruzi , homologous chromosomes on average differ 5%. A completely automated shotgun fragment assembler that is able to assemble such genomes, resulting in finished sequence, does not yet exist. We present DNPTrapper, a shotgun assembly visualization and editing tool, specifically designed for finishing complex repeated regions. DNPTrapper extends previously developed methods for rapid and sensitive multiple alignment construction, and detection of defined nucleotide positions, DNPs, that represent single base differences between repeat copies. In addition, it incorporates information such as mate pairs, map data and other sequence features. The results are displayed graphically in an editor, that provides the possibility of manual curation of the input data. Global and detailed views of several aspects of the input data can be visualized in a user-friendly manner, e.g. the length and location of repeat copies, differences between repeats, DNP clusters, mate-pairs, repeat borders, other featured sequence elements, as well as standard sequence data. Using DNPTrapper, we have resolved a repeated region in T. cruzi consisting of six 1800 bp tandem repeats, where the repeats differed 0.25% on average. DNPTrapper is a standalone assembly and finishing tool. In addition, we are currently integrating the program into a whole genome shotgun assembler under development. The aim is to let DNPTrapper resolve complex regions in a semi-automated fashion. The results are fed back into the assembler and adjusted accordingly. The tools will contribute to the assembly and finishing of complex genomes, and reduce the amount of work required.