Recall that one of the first steps in genome sequence assembly is determining which reads overlap. In this talk we will present recent results from a collaboration between the University of Maryland and the Baylor College of Medicine which measures the effect on assembly of various techniques for computing overlaps, while the remainder of the assembly process remains unchanged. The efficacy of some of the Maryland techniques have already been demonstrated last year in collaboration with Celera Genomics Inc., in their assembly of Drosophila melanogaster; here we study their effect on the assembly of the genome of Rattus norvegicus. As a basis for comparison, we test our assemblies against a small amount of independently finished sequence which exists for R. norvegicus.
The Atlas assembly at Baylor has already produced a high-quality draft sequence for R. norvegicus. Nonetheless, this still leaves some five percent of the mapped scaffolds in gaps. We find that when the set of overlaps are more carefully selected before being fed to Atlas, the quality of the scaffolds improves over the already high quality assembly. Specifically, the total amount of sequence produced increases by several percent, bringing us much closer to 100%; as well, correctness of individual bases and contig length improve.
Trimmed reads have far fewer bases than untrimmed reads. Making use of some of the low quality region is of considerable value since the U.S. government alone spends roughly $100 million generating these sequences annually. We use multi-read-comparison based error correction to generate a consensus sequence across long stretches of low-quality bases. We find that several moderately low-quality overlapping sequences can give us as much information as a single high-quality sequence, allowing us to extend the length of individual reads by up to 80%, giving several hundred extra bases per read and improving sequence assembly even further than above.