IMPROVING SEQUENCE ASSEMBLIES USING HIGH-QUALITY OVERLAPS

Michael Roberts1, James Yorke2, Brian Hunt, Wayne Hayes, Aleksey Zimin, Cevat Ustun, Paul Havlak
1tri@ipst.umd.edu, University of Maryland; 2yorke@ipst.umd.edu, University of Maryland

Finishing a genome costs about as much as the initial assembly, with most of that cost directed towards filling gaps (Celniker et al, Genome Biology 2002-03-12). Since initial assemblies typically get 95-99% of the sequence, any improvement in quality and amount of sequence to bring us closer to 100%, no matter how small, translates into an enormous cost savings for the finishing step.

Recall that one of the first steps in genome sequence assembly is determining which reads overlap. In this talk we will present recent results from a collaboration between the University of Maryland and the Baylor College of Medicine which measures the effect on assembly of various techniques for computing overlaps, while the remainder of the assembly process remains unchanged. The efficacy of some of the Maryland techniques have already been demonstrated last year in collaboration with Celera Genomics Inc., in their assembly of Drosophila melanogaster; here we study their effect on the assembly of the genome of Rattus norvegicus. As a basis for comparison, we test our assemblies against a small amount of independently finished sequence which exists for R. norvegicus.

The Atlas assembly at Baylor has already produced a high-quality draft sequence for R. norvegicus. Nonetheless, this still leaves some five percent of the mapped scaffolds in gaps. We find that when the set of overlaps are more carefully selected before being fed to Atlas, the quality of the scaffolds improves over the already high quality assembly. Specifically, the total amount of sequence produced increases by several percent, bringing us much closer to 100%; as well, correctness of individual bases and contig length improve.

Read Extension.
Trimmed reads have far fewer bases than untrimmed reads. Making use of some of the low quality region is of considerable value since the U.S. government alone spends roughly $100 million generating these sequences annually. We use multi-read-comparison based error correction to generate a consensus sequence across long stretches of low-quality bases. We find that several moderately low-quality overlapping sequences can give us as much information as a single high-quality sequence, allowing us to extend the length of individual reads by up to 80%, giving several hundred extra bases per read and improving sequence assembly even further than above.