CleanBank: a database of sequence artifacts

Hanne Volpin¹, Eitan Rubin²
¹hanne@agri.gov.il, Bioinformatics, Agricultural Research Organization, Bet Dagan, Israel; ²Eitan.Rubin@weizmann.ac.il, Bioinformatics and Biological Computing, Weizmann Institute of Science, Rehovot, Israel

CleanBank is a database that documents suspected artifacts found in sequences (e.g. vector contamination) and/or their annotation (e.g. erroneous species assignment) in the international sequence databases (INSD). INSD has an obligation to ensure completeness of the sequence record, and for crediting all the original authors of a sequence (Brunak et. al, Science 2002, 298:1333). However, as a result, researchers who identify errors in sequences have no way of publishing their findings in the original database. The artifacts cause two major problems: Inexperienced users of bioinformatics often misinterpret the results and experienced users still find that performing high-throughput research (e.g. EST assembly into transcripts) requires intensive cleaning of the sequences. To overcome this problem, and yet maintain the integrity of the original data, we have established a parallel database, CleanBank.

In CleanBank, artifacts are either reported by researchers, or identified by curated algorithms. Current algorithms detect E. coli contamination (using BLAT), and vector contamination (using BLAST and a novel method based in restriction site identification). Confidence levels are assigned to the reliability of the curated method used and to individual results. Single entries can be explored, or a cleaned version of the INSD can be produced according to the confidence level decided by the user.

For a more detailed description of the proposed database, and a preview of the data, see http://bip.weizmann.ac.il/MIW/CleanBank/index.html