Sunday Poster Presentations: Whole Genome Analysis

14 - Inferring Regulatory Structure from Expression Profiling of Mutants
Nir Friedman, Dana Pe'er, Hebrew University of Jerusalem
An important experimental design toward understanding the regulatory program involves global gene expression measurements of mutations. We infer causal structure from such measurements in a Bayesian framework. We present methods that derive statistical confidence in features of the regulatory network and apply these methods to data from the Young lab of S.cerevisiae mutations.

15 - Cross-genome Analysis of Herpes Viruses
Mar Alba, Rhiju Das, Christine Orengo, Paul Kellam, University College London
Protein families based on sequence homology were derived from complete herpesvirus genomes. The families were used to construct phylogenetic trees, based on all herpesvirus conserved protein sequences or on gene function conservation. Functional groups were correlated with expression clusters derived from array data of human herpesvirus 8 genes. All the data is being stored in VIDA (Virus Database).

16 - Integrating Gene Expression and Genome Sequence Analysis
Stefan Bekiranov, Eduardo Fajardo, Charlie Forster, Mathias Katzer, Carsten Meyer, Kass Schmitt, Alexander Sczyrba, Terry Gaasterland, The Rockefeller University
Explanation of gene expression cluster patterns requires integrating genomic features of genes, gene function annotations, and gene expression measurements. It also requires careful characterization of the amount of noise and error on each hybridized microarray through correlation of repeated experiments. The TANGO system has been designed to accomplish these goals.

17 - Using Homologous Gene Pairs Between Mouse and Human to Identify Gene Candidates that Display Sequence Variation
Tzu-Ming Chern, Winston Hide, University of Western Cape, South Africa
We attempt to understand the diversity of expression forms between human and mouse genes. Using a cross-species homologous approach, it is possible to rapidly identify potential splice variants. Candidate splice clusters are visualized using CRAW reports under STACKPACK package. Integration of alternate forms between species is presented.

18 - Extending Traditional Query-based Integration Approaches for Gene Characterization in Genomic Data
Barbara A. Eckman, Leo A. Laroco, SmithKline Beecham Pharmaceuticals
Gene characterization in genomic sequence requires integration of multiple heterogeneous datasources and analysis techniques. We provide SQL-like query access over relational and flatfile databases, Internet Websites, and results of on-the-fly analyses. Special-purpose query conditions like regular expression pattern matching on HSP alignments enable quick identification of potentially interesting results.

19 - The Influence of Oligonucleotide Frequency on Global Genome Structure and Biological Complexity
Ahmed Fadiel, Dong Qi, A. Jamie Cuticchia, The Hospital for Sick Children, Canada
Complete genomes of 25 organisms belonging to three cellular life domains (Archae, Bacteria, and Eukaryota) were explored. Genomic analysis was performed using proprietary software coupled with other programs. The influence of oligonucleotide frequencies on the global genome structure is discussed and the biological significance of some patterns is also explored.

20 - Identifying Small ORFs in Microbial Genomes
Terry Gaasterland, Alexander Sczyrba, The Rockefeller University
Proteins shorter than 100 aa remain difficult to identify computationally in genomic sequences. Statistical analysis of of codon usage and position bias in small open reading frames is unreliable. We rank small ORFs by conservation across genomes and nominate them for further experimentation.

21 - Identifying Candidate Coding Region Single Nucleotide Polymorphisms (cSNPs) and Alternative Splice Variants in Human Genome Using Assembled Expressed Sequence Tags (ESTs)
Kavita Garg, Deborah Nickerson, Phil Green, University of Washington
Using assembled ESTs from 50 different cDNA libraries, we have identified contigs that represent the complete coding sequences of 850 known human genes, and have scanned them for high-quality substitutions. We report the analysis and characteristics of candidate cSNPs found in coding regions of 165 of these genes.

22 - Eugene: A Non-redundant View of the Human Transcriptome
Jarret Glasscock, Warren Gish, Washington University
Eugene makes it easier to get transcript data for a given region, provides a non-redundant representation of the transcript data, provides high quality (genomic) sequence, and alleviates problems associated with chimeras and paralogues. Eugene accomplishes this is by stringent clustering of the transcript data with genomic data and subsequent extraction of the genomic segment.

23 - Analytical Methods for Genome to Genome Comparisons
Martin Gollery, David Rector, Al Shpuntof, Jim Lindelien, Time Logic Corporation
TeraBLAST is a hardware-accelerated implementation of BLAST that solves large homology search problems in a drastically shorter timeframe, yet provides a higher sensitivity than software BLAST. In this presentation we will present results of comparisons of more than two dozen completed genomes. Output files will be available on CD-ROM for interested researchers.

24 - Proteome-wide Prediction of Glycosylation Sites
Ramneek Gupta, Technical University of Denmark; Eva Jung, Swiss Institute of Bioinformatics; Jan Hansen, Søren Brunak, Technical University of Denmark
Glycosylation is an important post-translational modification that influences protein function. Two major types of protein glycosylation are N-linked (affecting Asn residues) and O-linked (Ser/Thr). No discriminative acceptor consensus sequence exists for either type. We train artificial neural networks to identify glycosylated sites, and observe their spread across different classes of proteins in a cell.

25 - Interrogating and Visualizing Annotated Whole Genome Databases
David P. Hansen, Guenther Kurapkat, Thure Etzold, LION Bioscience, Germany
Using LION's bioSCOUT, we have annotated a number of whole Microbial Genomes. The unique data integration capabilities of SRS is then used to compare the annotations of different genomes. SRS enables very rapid queries to, for example, find all of the members of a specific protein family in H. pylori that have an ortholog in E. coli and Haemopholis and show their respective localisation on the genome.

26 - A Proteomic System for Tracking and Identifying Yeast Mitochondrial Proteins
Mark Holmes, Melissa Kimball, Raymond Gesteland, Michael Giddings, University of Utah
Two database services for protein tracking and identification are described. Investigators enter data using assisted forms that automatically link to prior work. Subsequent protein identification can be done using common fields matched against public databases. Putative post-translational modifications are also predicted, using a heuristically-bounded, depth-first search; fuzzy logic evaluates for relative accuracy, depth and known frequency of candidate modifications.

27 - Data Analysis for Large-scale, Genome -wide Transcription Profiling
Wolfgang Huber, Anja von Heydebreck, German Cancer Research Center; Judith Boer, Leiden University Medical Center, The Netherlands; Friederike Wilmer, Holger Sültmann, Martin Vingron, Annemarie Poustka, German Cancer Research Center
We investigate 140 complex hybridisations from 38 paired kidney tumor and normal samples on nylon membranes spotted with a non-redundant human library of 32,000 clones. A non-parametric statistic is used to identify differentially expressed genes, giving quantifiable errors of first and second kind. The correlation structure of a hierarchical model of variation sources is calculated.

28 - Analysis of Helicobacter pylori Genome for Codon Usage and Base Composition in H. pylori-specific Genes of Unknown Function
Raphael Isokpehi, A. B. Sofoluwe, A. O. Coker, University of Lagos, Nigeria
Codon usage and base composition was determined for 206 Helicobacter pylori-specific genes of unknown function. Bias in codon usage and statistically significant deviation in base composition of some of the genes compared to the average H. pylori gene was observed. Understanding the function of these genes may be useful in drug discovery.

29 - Sister Chromatid Cohesion: Phylogenies, Motifs, and Interaction Networks
Susan Jones, John Sgouros, Computational Genome Analysis Laboratory, England
Cohesin links sister chromatids on the mitotic spindle. In budding yeast cohesin comprises Smc1p, Smc3p, Scc1p and Scc3p. We have identified new homologues of SMC proteins and created a phylogenetic tree. Using proteomic databases we have created a cohesion interaction network and identified sequence motifs common to pairs of proteins within the network.

30 - Identification of Putative Non-coding RNAs in Hyperthermophilic Archaea Genomes
Robert J. Klein, Sean R. Eddy, Washington University
We are using a bias in G+C content as the basis for a computational screen for novel structural RNAs in sequenced, AT-rich, hyperthermophile genomes. We have shown that the screen identifies most known and several putative novel noncoding RNA loci. We are experimentally testing whether these loci are expressed.

31 - Identification of Regulatory Sites in Bacterial Genome
Hao Li, University of California, San Francisco; Eric D. Siggia, The Rockefeller University
We present a new algorithm that identifies the binding sites of different regulatory proteins from genome sequences by detecting all significant patterns of the form w_1N_xw_2. The patterns are grouped into clusters and weight matrices derived by an alignment of the matching sequences. For E. coli, we obtained ~100 matrices matching 1/3 of the characterized factors with high statistical significance.

32 - Non-structured Regions in Genomic Proteins: Junk or Functional?
Jinfeng Liu, Burkhard Rost, Columbia University
We found that regions with no regular secondary structure with more than 50 consecutive residues were common in all three kingdoms, in particular in eukaryotes. Although not structured, we found the floppy regions to be evolutionarily conserved, and--on average--relevant for functional (SwissProt annotation, yeast-two-hybrid data).

33 - Using Non-homology Methods for Genome-wide Prediction of Protein Function
Edward Marcotte, University of California, Los Angeles, Protein Pathways, Inc.; Matteo Pellegrini, Michael Thompson, Todd Yeates, David Eisenberg, Protein Pathways, Inc.
Genome and expression analyses reveal functional links between genes, useful both for predicting protein function and for creating protein networks analogous to those derived experimentally. We apply these methods (phylogenetic profiles, expression clustering, and Rosetta Stone analysis) to find yeast protein networks and predict function for >50% of the uncharacterized yeast proteins.

34 - High Message Variety and Low Intrinsic Error Correction Capabilities of Whole Genome Symbol Strings: An Information Theoretic Perspective
Preeti Mehta, Ramneek Gupta, S. Krishnaswamy, Madurai Kamaraj University
Information content calculations based on Shannon entropy for 22 prokaryotic genomes and 29 eukaryotic chromosomes are reported. Chromosomes in eukaryotic organisms tend to maintain similar information densities. For all genomes, information density values (Id), are low indicating a state of homeostasis with respect to maintaining high potential message variety and low intrinsic error combating capabilities.

35 - Whole Genome Phylogenies via Singular Value Decomposition (SVD)
Karen Moffett, Mathew Kay, Steve Baker, Gary Stuart, Indiana State University
A novel SVD-based analysis utilizing all possible overlapping tripeptides was used to produce phylogenetic trees from whole genome data. Both gene and species trees resulted from the same analysis. In one application, more than 1,280 mitochondrial proteins (and the nearly 100 metazoan organisms they represent) were accurately placed within phylogenetic trees.

36 - Preservation and Prediction of Transcription Units Across Microbial Genomes: Escherichia coli, and Haemophilus influenzae
Gabriel Moreno-Hagelsieb, Centro de Investigacion sobre Fijacion de Nitrogeno, Mexico; Temple F. Smith, Boston University; Julia Collado-Vides, Centro de Investigacion sobre Fijacion de Nitrogeno, Mexico
Genes among gram-negatives are found together more frequently when they correspond to Escherichia coli operons, than to transcription unit boundaries. The point of highest prediction accuracy is coincident with the point of highest preservation profile of predicted operons versus transcription unit boundaries. We extend such predictions to Haemophilus influenzae.

37- Distribution of the Local Thermodynamic Stability in the Complete Genome Sequence
Shu-Yun Le, Wei-min Liu, Jih-H. Chen, Jacob V. Maizel, Jr., National Institutes of Health
We reveal the whole distribution of local thermodynamic stability in single stranded RNA sequences corresponding to the complete H.pylori genome. Our results indicate that the highly stable and more statistically significant folding regions are predominantly in noncoding sequences, while the highly unstable regions are predominantly in the protein coding sequences.

38 - Anubis 4: A Java Genome Map Viewer
J. Paul Nelson, Alan L. Archibald, Andy S. Law, Roslin Institute, UK
We present a Java enabled map viewer - Anubis 4 that graphically displays genome maps from one or more species. Each label on the map provides a clickable link to details of mapped objects. Functionality is placed on the client side: maps can be rotated, flipped, zoomed, or moved almost instantaneously.

39 - A New Hashing Algorithm for Genomic Sequence Search
Zemin Ning, James C. Mullikin, Wellcome Trust Genome Campus, UK
A new hashing algorithm has been developed for genomic sequence search. Sequences are preprocessed into a hash table and search is only conducted upon the hash table rather than the whole database. Between two to three orders of less CPU time can be achieved compared with two widely used search tools.

40 - A DNA Structural Atlas for Escherichia coli
Anders Gorm Pedersen, Lars Juhl Jensen, Søren Brunak, Hans-henrik Staerfeldt, David W. Ussery, The Technical University of Denmark
We have performed a computational analysis of DNA structural features (curvature, flexibility, and stability) in several prokaryotic genomes. For visualizing results we developed the "Genome Atlas" where structural measures in entire chromosomes are plotted in the form of color-coded wheels. Possible biological interpretations of our results are discussed. http://www.cbs.dtu.dk/services/GenomeAtlas/

41 - A Whole-genome Analysis of Combinatorial Regulation of Gene Networks
Yitzhak Pilpel, Priya Sudarsanam, George Church, Harvard Medical School
We have launched a whole-genome survey to systematically identify combinatorial sets of transcription factors that regulate the expression of multi-gene networks. A probability model was utilized to select statistically significant combinations of regulatory motifs. The biological significance of such motif combinations was assessed by their impact on mRNA expression.

42 - New Algorithms for Large-scale EST Clustering
Andrey Ptitsyn, Winston Hide, South African National Bioinformatics Institute
We developed a new EST clustering algorithm, based on a new fast linear statistical measure of sequence similarity that ignores low-complexity. It is implemented in two variants: loose clustering with assembly by a third party system and stringent clustering with simultaneous consensus generation and alternative variant apprehension. Tests show a significant improvement against D2_cluster in computation time and cluster sizes.

43 - Novel Pattern Prediction in Complete Genomes
Dong Qi, Ahmed Fadiel, A. Jamie Cuticchia, The Hospital for Sick Children, Canada
Higher-order oligonucleotides were used to address whether or not Markov chain models can provide a unified framework in which to describe whole genomes in terms of constituent oligonucleotides. Algorithms capable of oligonucleotide predictions were developed. Our results showed that the (n-2) order Markov chain is a unified better predictor. Intra-genomic variability is also discussed.

44 - Large-scale Genomic Sequence Composition Analysis
Eric C. Rouchka, David J. States, Washington University
We attempt to collate a definitive set of nonredundant extended human genomic sequences by extending individual human entries in GenBank for the purpose of analyzing chromosomal genome organization at the sequence level. We present our results concerning isochore organization as well as studies into the mechanisms that promote isochore maintenence.

45 - Computation and Visualization of Degenerate Repeats in Complete Genomes
Chris Schleiermacher, University of Bielefeld
A systematic study of repetitive DNA on a genomic scale requires extensive algorithmic support. We have developed the Reputer family of programs that serve as a fundamental tool in such studies. Efficient and complete detection of various types of repeats is provided together with an evaluation of significance, interactive visualization, and simple interfacing.

46 - In silico Cloning of Chromosome 7q35-q36 for an Accurate Physical Map, Refinement of the Region, and Identification of Candidate Genes for Mutation Analysis
T. M. Severson, K. Rust, H. Xu, L. Huynh, J. Lie, C. Bodell, T. S. Acott, M. K. Wirtz, Oregon Health Sciences University
Primary open-angle glaucoma (POAG) in a large designated glaucoma family mapped to chromosome 7q35-q36. The region defined as GLC1F was refined and a physical map was constructed using computational tools, unfinished (phase 0-2) search databases, and finished (phase 3) search databases at the National Center for Biotechnology Information (NCBI).

47 - A Novel Method for Estimating Orthology Confidence
Christian Storm, Erik L. L. Sonnhammer, Karolinska Institutet, Sweden
A novel method is presented that analyzes bootstrap trees for orthologs. Instead of searching the optimal tree for branches that support orthology with high bootstrap support, we analyze each bootstrap tree and look for orthology assignments that occur frequently. The frequency in turn provides a confidence estimate.

48 - Extraction of Microscopic and Macroscopic Structural Features from the DNA Sequence of Human Chromosome 22
Hironobu Takahashi, Yasuhide Mori, Ryuichi Oka, Real World Computing Partnership, Japan
The proposed method is an extended algorithm of so-called the Galaxy Clustering Method that has been developed for natural language processing, and is applied to a sequence categorization and a multiple alignment problem without any segmentation processes. The experimental results on the whole sequence of human chromosome 22 are shown.

49 - Analysis of Protein Coding Genes in Complete Microbial Genomes
Tatiana Tatusova, Sergei Resenchuk, Ilene Karsch-Mizrachi, James Ostell, National Institutes of Health
A new approach combines protein similarity search with phylogenetic classification. The analysis at the single gene level includes the comparison of amino-acid sequences of proteins encoded by complete genomes to proteins in current databases. Neighbor relationships to the proteins with known 3-D structures are detected and linked to Cn3D viewer.

50 - Functionating the Proteome of Aquifex aeolicus
M. J. Thompson, M. Pellegrini, Protein Pathways, Inc.; E. M. Marcotte, T. O. Yeates, D. Eisenberg, Protein Pathways, Inc., University of California, Los Angeles
Functionating the proteomes of sequenced organisms is a great challenge in current computational biology. In contrast to homology-based methods, recent techniques infer cellular functions for proteins based on contextual and evolutionary information from whole proteomes. Here, we examine and illustrate the performance of these methods in functionating the proteome of Aquifex aeolicus.

51 - High-throughput SNP Mining of the Human Transcriptome
Raymond T. Yeh, Gabor T. Marth, Ian Korf, Warren R. Gish, Washington University School of Medicine
We have developed a pipeline that clusters ESTs from dbEST to finished and draft quality human genomic sequence. We then mine these clusters for single-nucleotide polymorphisms using the POLYBAYES SNP discovery tool. Accurate clustering of ESTs allows reliable SNP detection free of false positives due to the existence of paralogs.