Poster Presentations Sorted by Number

Click here to jump to Section A
Click here to jump to Section B
Posters Sorted By Last Name


Section A

A01:
Identifying miRNA by use of database structure

Subject: Databases & Ontologies

Presenting Author: Zach Abrams, Ohio University

Author(s):
Vijayanand Nadella, Ohio University
Sarah Wyatt, Ohio University
Harvey Ballard, Ohio University

Abstract:
Databases can help researchers evaluate vast amounts of information. The data used in this research was generated from whole genome miRNA sequencing experiments. Total RNA was collected from chasmogamous and cleistogamous flowers of Viola pubescens. Small RNA population was isolated by PAGE gel electrophoresis and gel extraction. A small RNA library was then generated using the Illumina Trueseq small RNA prepration kit. The purified cDNA library was used to cluster generation on Illumina’s cluster station and then sequenced on Illumina GAIIx instrument. Raw sequencing reads were obtained using Illumina’s sequencing control studio software version 2.8 following real-time sequencing image analysis and base-calling by Illumina’s Real-Time Analysis version 1.8.70. The ~22 million filtered miRNA-mappable unique reads were then aligned to miRNAs in miRbase, mRNA, RFam and Repbase. Only ~3.3 million reads were successfully aligned using different databases and genome sequence of Populus, Viola’s closest relative whose genome sequence is available. The normalized mapped miRNA read quantification from both the chasmogamous and cleistogamous flowers produced over 100 miRNAs with differential expression pattern. About 85% of the reads have produced no hit and the mapped reads have been classified into different groups based on the alignment to the database. This data structure has presented a need for a MySQL database to properly store, mine, add features, and an option to grow the database from future experiments. The data generated and stored in this database can be effectively used to identify new miRNA from the evolving miRNA and other databases.


top
A02:
YouGenMap: your place to load, share and compare genome maps

Subject: Databases & Ontologies

Presenting Author: Keith Batesole, Miami University

Author(s):
Kokulapalan Wimalanathan, Miami University
Lin Liu, Miami University
Craig Echt, USDA Forest Service

Abstract:
Linkage maps help geneticists determine the order genes and genetic markers are arranged and the approximate distance among them, whereas physical maps pinpoint the exact position a gene or genetic marker is found in a chromosome. With the rapid genomics data accumulation, there are growing demands for bioinformatics tools that can be easily utilized by biologists to examine, visualize, compare and consolidate their linkage maps or physical maps, which need constant improvement based on new data. Comparative genetic mapping between species or within species allows examination of genome organization and evolution and helps information transfer from map-rich to map-poor species. Using the state-of-art web technology including PHP and JavaScript, we have developed YouGenMap, an AJAX-based web service that allows biologists to load, share and compare their genome maps. With easy-to-use UGIs of enhanced interaction and usability, YouGenMap is a genetic map viewer that lets users upload, display, download, update, share, and compare sets of mapping and marker annotation data. User data is uploaded and downloaded as a Microsoft Excel file using the mapset template format provided. YouGenMap allows users to visualize multiple mapsets at a time, and has flexible options for displaying correspondences among maps. The correspondence lines between markers can be drawn between two maps that are not adjacent. Moreover, clicking on a feature in a displayed map will show its annotations and map data. User can selectively display desirable features by applying a filter on feature types. With YouGenMap, genetic maps and their annotations become dynamic community assets.


top
A03:
OMERO.searcher: Content-based image search for microscope images

Subject: Databases & Ontologies

Presenting Author: Ivan Cao-Berg, Carnegie Mellon University

Author(s):
Jennifer Bakal, Carnegie Mellon University
Baek Hwan Cho, Carnegie Mellon University
Robert Murphy, Carnegie Mellon University

Abstract:
Fluorescence microscopy has grown dramatically over the past decade both in terms of technical capabilities and the volume of images generated, allowing for the development of fast and effective means of searching records by context or content. Because databases usually contain context descriptors in the form of annotations, context-based searches are readily available in many database driven tools. Similarly, content-based searches can use content annotations, such GO terms, to retrieve images but this approach may be limited by the “resolution” of such terms. Whereas, content-based image retrieval, also known as query-by-image, queries and retrieves the most similar images in terms of numerically computed features using a measure of similarity. Hence facilitating discovery of new patterns or similarities between known patterns. We present OMERO.searcher, a robust, flexible, open-source content-based image search tool for the computational biology community. Based on the Open Microscopy Environment, it implements a modified version of the FALCON algorithm using subcellular location features (SLFs) that have been stored in the database in the form HDF5 files. Tests were performed in two distinct fluorescence microscopy databases. Classes of images with the same content annotations were created, and images were ranked by similarity to one or more query images drawn from one of those classes. Success was measured using the area under a receiver-operator-characteristic curve.


top
A04:
DominoQuery, a research-friendly deployable query environment

Subject: Databases & Ontologies

Presenting Author: Rajesh Cherukuri, Case Western Reserve University

Author(s):
Joe Teagno, Case Western Reserve University

Abstract:
We introduce DominoQuery, a research tool for taking well-structured XML documents capturing the details of a clinical study, which allows a researcher to query the data set quickly, efficiently, and accurately. DominoQuery achieves these goals by using a data store with text-mining capabilities, dubbed DominoStore, and a web-based query tool we call XQRuby.

DominoStore was developed to enable researchers to easily access and manage their document store. It allows users to search their data with ease and does not require prior experience in managing or using a database. DominoStore is a Java application that allows the user to add their documents to the document store via a graphical user interface and automatically starts the process of indexing the document and running a kernel-based cluster text mining to determine possible queries for the purpose of hastening a user's queries. DominoStore has one drawback: though the user can enter a query, the result is an XML blob.

The query tool employs a Domain-Specific Language (DSL) created for DominoQuery, XQRuby, which allows for the query interface to be able to connect to the document store. XQRuby serves as a bridge between the application's native code, written in Ruby, and the queries into the DominoStore, executed in XQuery. XQRuby shoulders all the responsibility for coordinating incoming queries from the Ruby code and leveraging the caching and acceleration enhancements provided by the DominoStore. Comprehensively, DominoQuery enables a researcher to forgo any tedious software configuration and parse through their data efficiently.


top
A05:
MedSwine: A Browser and Genome Portal for Swine in BioMedicine

Subject: Databases & Ontologies

Presenting Author: John Garbe, University of Minnesota

Author(s):
Bin Zang, University of Minnesota
Chris Hackett, University of Minnesota
Deepak Reyon, Iowa State University
Aymeric Duclert, Cellectis, Inc
Christophe Delenda, Cellectis, Inc
Scott Fahrenkrug, University of Minnesota

Abstract:
The conservation of gene-function and physiology between people and pigs advocates for the use of swine for modelling human disease. Furthermore, large litters and advanced reproductive and genome-engineering technologies in swine provide efficient methods for the creation of large-animal models for biomedical and pharmacological research. The release of the Sscr10.2 swine genome build provides a valuable resource for guiding the engineering of the swine genome for the creation of these models.
We have used sequence similarity to identify potential target porcine genes by annotation transfer from the Online Mendelian Inheritance in Man, Human Phenotype Ontology database, Mouse and Rat Genome Databases, as well as the Online Mendelian Inheritance in Animals. This annotation transfer is integrated with data on putative target sites for genome engineering tools. The MedSwine portal provides two interfaces for querying these annotations: a GBrowse genome browser and a web search interface that permits searching for genes (by name or ID) or ontology (by ID or term). We demonstrate the utility of this resource by identifying intersections between swine genes, human disease genes, and putative target sites for genome engineering tools. The ability to examine these sites en masse and in the context of gene structure will serve as an invaluable tool for engineering the swine genome for both biomedical and agricultural applications. The MedSwine browser and portal is publicly available at http://vector.cfans.umn.edu/MedSwine/.


top
A06:
Evaluating Predictive Drug Biomarkers Across 31,000 + Clinical Samples

Subject: Databases & Ontologies

Presenting Author: Ross Patterson, Compendia Bioscience

Author(s):
Sarah Anstead, Compendia Bioscience
Chris Gates, Compendia Bioscience
Mark Tomilo, Compendia Bioscience
Peter Wyngaard, Compendia Bioscience
Dan Rhodes, Compendia Bioscience

Abstract:
Associating predictive biomarkers with specific cancer subtypes allows the patient populations to be stratified for efficient clinical trials. This is difficult to do with microarray data because while many datasets measure the genomic attributes of these cancer subtypes, the subtype sample size is often insufficient to reach desired statistical significance. Furthermore, the samples cannot be directly compared across datasets due to the inconsistency of sample metadata reporting and experimental batch effects. To address these challenges, we present an approach to normalize individual datasets so that they can be combined into an integrated meta-dataset that contains sample subsets large enough to make confident associations between biomarkers and disease subtypes. Combining 337 datasets in Oncomine™ measured on specific platforms, we compiled a dataset that represents over 31,000 distinct samples. The experimental data was normalized across datasets using quantile normalization to reduce batch effects. The associated sample metadata was curated applying Oncomine™ annotation and ontology standards.

Using this integrated dataset we identified a refined patient population for a MEK inhibitor. As expected based on the literature and previous clinical experience, subsets of melanoma, pancreatic, and colorectal cancers were identified as the best candidate populations. Further analysis showed that the prevalence of a sensitivity signature is highest in primary colorectal cancer and lower in metastatic. Late stage, metastatic patients are usually selected in early clinical trials, and this finding may indicate why trials for MEK inhibitors in colorectal cancers have been largely unsuccessful, while results in melanoma and pancreatic cancer have been more promising.


top
A07:
Discovery of Novel Cancer Genes through Application of Clinical Metadata Ontology

Subject: Databases & Ontologies

Presenting Author: Jeff Bonevich, Compendia Bioscience

Author(s):
Rachel Dull, Compendia Bioscience
Chris Gates, Compendia Bioscience
Becky Steck, Compendia Bioscience
Peter Wyngaard, Compendia Bioscience

Abstract:
The full value of genomic data in the discovery of novel cancer genes can only be realized by looking to clinical metadata for meaningful explanations. These metadata vary widely in detail, availability, and consistency, and the lack of accepted standards in their collection limits the possibility for scalable analysis. Though standards exist for the submission of genomic data (e.g. MIAME), these standards often omit the submission of clinical metadata (e.g. demography, pathology, treatment). Collections of these metadata are highly inconsistent and loosely defined. Creating a clinical metadata controlled vocabulary enables more effective querying of the genomic data and faster discovery of the genes driving tumors.

Large-scale analyses of the genomic data, leading to the discovery of significant drug targets, often involve comparisons across multiple datasets. Such comparisons become possible with more stringent controls on clinical metadata terms. To enable comparability, Compendia Bioscience has developed an ontology of oncology-focused terms. The Compendia Ontology includes a controlled vocabulary capturing common information on the sample, patient, and dataset level, allowing them to be searched and filtered within Oncomine™, our global collection of genomic cancer data. These data include hierarchical relationships between terms. Automated rules based on set points within the hierarchy define sample groups used to compute meaningful analyses on the genomic data. These analyses have contributed directly to the discovery of novel cancer genes.


top
A08:
LRpath analysis with clustering functionality reveals common pathways dysregulated via DNA methylation across cancer types

Subject: Databases & Ontologies

Presenting Author: Jung Kim, University of Michigan

Author(s):
Alla Karnovksy, University of Michigan
Vasudeva Mahavisno, University of Michigan
Terry Weymouth, University of Michigan
Dana Dolinoy, University of Michigan
Laura Rozek, University of Michigan
Maureen Sartor, University of Michigan

Abstract:
We developed a web-based gene set enrichment application, called LRpath, with clustering functionality allowing for the identification and comparison of biological concept signatures across multiple published studies. LRpath uses an internal annotation database that contains a wide variety of gene sets representing several types of biological knowledge, such as functional annotations, literature derived concepts, target sets, interactions, metabolite-centered concepts and chromosomal location across multiple species. Advantages of the LRpath web application include powerful performance with both small and large sample sizes, the ability to search against > 20,000 predefined concepts from 16 different annotation databases, and options to set various analysis parameters, perform two different types of test, and visualize the enrichment profiles across multiple experiments. The undirectional test in LRpath distinguishes between enriched and depleted concepts, while the directional test identifies concepts enriched with genes that are up- or down-regulated. The clustering analysis integrates the enrichment results to interactively view and explore results across experiments.
We illustrate the use of LRpath using ten cancer versus normal studies of DNA methylation profiled with the Illumina HumanMethylation27 BeadChip. We identify several cancer-related pathways significantly affected by differential methylation across multiple cancer types. Commonly hypomethylated concepts include immune-related functions, peptidase activity, and epidermis/keratinocyte development and differentiation. Commonly hypermethylated concepts encompass transcription factors, nervous system and embryonic development, and voltage-gated potassium channels. Interestingly, fewer DNA repair genes are differentially methylated than expected by chance. However, despite the overall depletion, a few key regulator genes such as MGMT are still differentially methylated.


top
A09:
Pedigree Query, Visualization, and Genetic Calculations Tool

Subject: Databases & Ontologies

Presenting Author: Murat Kurtcephe, Case Western Reserve University

Author(s):
En Cheng, Case Western Reserve University
Z. Meral Ozsoyoglu, Case Western Reserve University

Abstract:
Family trees, a.k.a. pedigrees, are becoming increasingly important in human genetics, because they can be utilized to trace a genetic disorder or trait, to calculate disease risks, and to facilitate genetic counseling. Pedigrees can be represented as directed acyclic graphs, where nodes represent individuals and edges present parent-child relationships. In this study, we present a new pedigree-based system for pedigree query, visualization, and genetic calculations. A novel query interface is proposed where users can form complicated queries via an easy-to-use graphical user interface with no need for any knowledge of high level query language such as SQL or XPath. Pedigree data can be queried by conditions based on the pedigree structure, or using attributes of individual entries in the pedigree or both. A graph encoding method called NodeCodes enables our system to efficiently evaluate relationship-based queries without traversing the graph or using recursive query calls. The visualization of the pedigree data as a dynamic drawing, allows users to analyze the query results in a more understandable form. Users also can interact with the visualization to select individuals for genetic calculations or performing simple relationship-based queries. The system also provides genetic calculations including inbreeding, kinship, and identity coefficients. Instead of using traditional recursive methods, our system performs pedigree calculations by using path-based formulas coupling with NodeCodes to achieve efficiency and scalability.


top
A10:
Echinobank: Identifying alignable orthologs from next generation transcriptome data

Subject: Databases & Ontologies

Presenting Author: Calvin Lam, The Ohio State University

Author(s):
Jacob Aaronson, The Ohio State University
Daniel Janies, The Ohio State University

Abstract:
Echinobank’s (http://echninobank.osu.edu) goal is to identify alignable orthologs for phylogenetic analysis from next-generation transcriptome data for echinoderms. 454 runs were prepared by extraction of RNA via RNeasy minikit, amplification of RNA via Invitrogen's Super Script RNA Amplification System, and cDNA library preparation via Roche’s cDNA Rapid Library Prep. The 454 sequence reads were assembled using NEWBLER version 2.5.3. We created a workflow that takes assembled 454 sequence data organized into isogroups and uses the longest isotig as a representative sequence of the isogroup. Each isogroup is compared (via BLASTN) against the reference sequence database (RefSeq Release 50 at the National Center of Biotechnology Information; NCBI). The outputs are precomputed high scoring pairs for that specific isogroup sequence. Echinobank stores the similarity scores and annotation from high scoring pairs and displays them via a web based application. A test dataset of transcriptome data consisting of six taxa was loaded into the database and preliminary results were gathered. When compared against the entire reference sequence database, there were: 2045 reference sequence hits across 1 taxon, 479 hits across 2 taxa, 639 hits across 3 taxa, 7504 hits across 4 taxa, and 224 hits across 5 taxa. When compared against only the sea urchin reference sequences, there were: 401 hits across 1 taxon, 479 hits across 2 taxa, 639 hits across 3 taxa, 7504 hits across 4 taxa, and 224 hits across 5 taxa.


top
A11:
Comparison of free web-based data capture solutions for international clinical research

Subject: Databases & Ontologies

Presenting Author: Andy Lin, University of Michigan

Abstract:
Capturing patient and research data is a key task in clinical research. Enterprise level solutions are often very expensive and/or difficult to support. Low-cost lightweight solutions that can address the key requirements of most clinical research projects is highly desirable. We believe an ideal solution should meet the following criteria 1) web-based, allowing access from multiple sites, 2) support for multiple languages, 3) low or no cost, and preferably open-source. We compared three free web-based data capture solutions that meet these requirements: LimeSurvey, REDCap and OpenClinica.

All three solutions are designed for primarily for data capture but not reporting or analysis. LimeSurvey has the best language support, providing translations for over 50 languages. However, it lacks study management tools. REDCap is designed for clinical research, so it supports audit tracking and handling of personal identifiers. It also provides a project calendar for longitudinal studies. It is limited in the number of question types it supports and currently provides only a Chinese translation. OpenClinica is designed for clinical trials in multiple sites and supports multiple languages. It provides extensive study role management capabilities. It also supports audit tracking and a discrepancy notes system for data monitoring.

We experimented with using all three solutions for our pilot study. OpenClinica was selected because of its ability to handle multiple sites and different levels of user access, and the discrepancy notes system. Report functionality is handled with custom reports, tailored to individual studies, and a project calendar is being developed.


top
A12:
Transferring Biology Data into Graph Model: A Comprehensive Graphic User Interface to Convert, Manage and Represent Biological Networks in Graph Database

Subject: Databases & Ontologies

Presenting Author: Yunkai Liu, Gannon University

Author(s):
Lei Zhang, Gannon University

Abstract:
Interactions between biological entities, whether cells, proteins or genes are under meticulous study. Large amount of bio-network data, which are naturally represented in graph model, have been produced. The management of those data is a difficulty, due to the restriction of relational database and the requirement of special queries in modeling. Recently, some ground-breaking improvements on graph database technology allow us to deal with large-size bio-network data in database level, instead of application level. Thus, a user-friendly interface integrating powerful functionalities is a necessity. We developed a comprehensive graphic user interface (GUI) to convert, manage and represent bio-network data in a graph database, named as NEO4j. The interface provides three kinds of functionalities: different dataset converting, efficient data management and large graph visualization. The interface enables to convert data from major on-line databases. Users also can import or export their own databases. Multiple query-functions have been developed to manage existing databases. Some of those queries are to simulate the SELECT and Data Manipulation Language (DML) in Structured Query Languages (SQL). Special No-SQL queries are also provided to fit into graph modeling requirements. A visual panel enables users to directly observe and manage their networks with up to 10,000 nodes. Layout management algorithms are applied to sorting and representing large networks. User-friendly features, such as drag-and-drop, are added into the panel. The interface is designed for users of embedded databases, and is easy to be extended into a handy tool for administrating server-based graph databases.


top
A13:
Development of PlantSecKB: an Integrated Plant Secretome Knowledge-Base

Subject: Databases & Ontologies

Presenting Author: Xiang-Jia Min, Youngstown State University

Author(s):
Gengkon Lum, Youngstown State University

Abstract:
PlantSecKB (http://proteomics.ysu.edu/secretomes/plant.html) provides a resource of all secreted proteins, i.e. secretomes, for all plants. The database was constructed with all the available plant protein data from the UniProt database and predicted plant protein sequences from EST data assembled by the PlantGDB project (http://www.plantgdb.org/prj/ESTCluster/). The database contains information from three sources: (1) information generated using a computational protocol including SignalP, TMHMM, TargetP, Phobius and PS-Scan; (2) annotation of subcellular locations that were manually curated or computationally predicted in the UniProt database; (3) subcellular locations that were manually curated by our curators from recent literature. With a web-based user interface, the database is searchable, browsable, and downloadable by using UniProt accession number, NCBI GI, RefSeq accession number, key words, and species. A BLAST utility was integrated to allow users to query the database based on sequence similarity to protein sequences of their interest. A tool was also implemented to support community annotation for subcellular locations of plant proteins. With the complete data available for plants and associated web-based tools, PlantSecKB will be a valuable resource for exploring the potential applications of plant secreted proteins. This work is supported by the Ohio Plant Biotechnology Consortium.


top
A14:
Graph by example: a semi-structure query language over graph-based RDF database

Subject: Databases & Ontologies

Presenting Author: Shi Qiao, Case Western Reserve University

Author(s):
Lei Yang, Case Western Reserve University

Abstract:
With the web-based applications growing at a fast pace, more and more large RDF databases become available for users to access without knowledge of any data processing language. The current trend of storing RDF data as a large graph shows its extraordinary benefits compared with triple-based storage. Recent query languages over graph-based RDF data fall into two categories: keyword based query language without graph structure (Steiner tree problem) and strict graph structure based query language (SparQL). However, both categories are not suitable for inexperienced users since either they don't have any control over the graph structure or should have full knowledge about it. In this paper, we propose a new extensible semi-structure query language, graph by example, to support the needs of specifying partial graph structure information with keyword based graph query language. We first compare our technique with other query languages theoretically to show the outstanding advantages of using graph by example. Secondly, we propose a new indexing structure based on extension of neighborhood signature and graph compression to better support graph by example queries over extremely large RDF databases. In order to test the performance of graph by example, we use both synthetic and real RDF benchmarks to do the experiments. The results confirm the feasibility and flexibility of query by example with our new indexing structure.


top
A15:
VirmugenDB: A database and analysis system of virulence factors whose mutants can be used as live attenuated vaccines

Subject: Databases & Ontologies

Presenting Author: Rebecca Racz, University of Michigan

Author(s):
Monica Chung, University of Michigan
Zuoshuang Xiang, University of Michigan
Yongqun He, University of Michigan

Abstract:
Vaccine is one of the most powerful methods in preventing and fighting against infectious disease. One method of vaccine development is the generation of live attenuated vaccines by mutation of genes encoding virulence factors. "Virmugen" is coined here to represent a gene that encodes for a virulent factor of a pathogen and has been proven feasible to make a live attenuated vaccine by knocking out this gene. Not all virulence factors can be used for vaccine development. While numerous studies have been reported in development of live attenuated vaccines, systemic analysis of virmugens has not been performed. In this study, a web-based VirmugenDB database has been generated (http://www.violinet.org/virmugendb). Currently, VirmugenDB includes over 220 virmugens that have been verified to be valuable for vaccine development against over 55 pathogens. Bioinformatics analysis has revealed significant patterns in mutated genes for both bacteria and viruses. For example, 10 Gram-negative and 1 Gram-positive bacterial aroA genes are virmugens. A sequence analysis has revealed at least 50% of identities in the protein sequences of the 10 Gram-negative bacterial aroA virmugens. As a pathogen case study, Brucella virmugens were analyzed. Out of 15 Brucella virmugens in this study, six are related to carbohydrate or nucleotide transport and metabolism, and two involving cell wall biogenesis. More patterns have been identified with COG analysis and will be reported. The bioinformatical annotation and analysis of virmugens helps elucidate the mechanisms of microbial pathogenesis and host immunity and further supports rational design of future live attenuated vaccines.


top
A16:
Ca-MIMI: MIMI System for Case Comprehensive Cancer Center

Subject: Databases & Ontologies

Presenting Author: Shiqiang Tao, Case Western Reserve University

Author(s):
Guo-Qiang Zhang, Case Western Reserve University
Sri Cherukuri, Case Western Reserve University
Licong Cui, Case Western Reserve University

Abstract:
MIMI (Multi-modality Multi-resource Information Integration System)integrates administrative and scientific functions of a core facility and captures a complete, essential set of raw information about people (researchers and staff members), projects, resources, materials, experimental workflow, accounting workflow, scientific data, and resource scheduling. Three notable features of MIMI are (1) a friendly and uniform web-based interface to make access to the system unconstrained by time, space, or types of computer system; (2) autonomous operation with minimal overhead for data entry and system administration support, achieved by decentralized content management and the integration of experimental workflow management with an automated scheduling program; (3) broad applicability to centers and cores with different sizes and scopes by a thorough analysis of data management needs at a variety of facilities with a common set of entities and center-level workflow, and a flexible and expandable implementation.

The Case Comprehensive Cancer Center (Case CCC) based at Case Western Reserve University is a partnership organization supporting all cancer related research efforts at CWRU, University Hospitals Case Medical Center, and the Cleveland Clinic.

Ca-MIMI is a a successful deployment of MIMI system for facility resource management for Case CCC shared resources. It manages the whole life cycle of facility usages: scheduling, execution, invoicing, and report. This poster gives a brief description about strategies of how Ca-MIMI completes its scientific and administrative functions.


top
A17:
The Rat Genome Database: Challenges for Data Loading Pipelines

Subject: Databases & Ontologies

Presenting Author: Marek Tutaj, Medical College of Wisconsin

Author(s):
Elizabeth Worthey, Medical College of Wisconsin
Mary Shimoyama, Medical College of Wisconsin
Howard Jacob, Medical College of Wisconsin
Jennifer Smith, Medical College of Wisconsin

Abstract:
The Rat Genome Database strives to present a variety of genomic and phenotypic data from multiple perspectives, in order to be useful for individuals engaged in basic, clinical, and transitional research. A combination of targeted literature curation and a network of automated pipelines provides comprehensive functional coverage of the genome. Through our automated pipelines RGD integrates data on genomic elements from multiple sources with a variety of functional data from multiple species. Orthologs and mappings are created for rat, human and mouse. Genes requiring specific nomenclature review are tagged for review with provisional nomenclature provided. Multiple ontologies are incorporated; obsoleted terms and annotations are identified. Experimentally determined human and mouse ortholog Gene Ontology annotations for rat genes are provided. Identifiers, data, and links for major protein databases are also added and updated regularly. Particular care is taken to incorporated high quality annotations; stringent quality control in the pipelines identifies conflicts, omissions and questionable relationships among data originating at other sources as well as with data already in RGD. Conflict reports are automatically sent to curators for resolution. Web pages with detailed pipeline logs assist curators in management of these tasks. An in-house Java-based pipeline framework allows for efficient development of new pipelines taking advantage of multiple core machines and suitable for processing bulk XML data. Different data loading strategies are used, the most common being drop-and-reload and incremental updates. This sophisticated data pipeline network allows RGD to provide comprehensive genome-wide functional and structural information which is both tightly regulated but flexible and up-to-date.


top
A18:
A tool for the standardized storage, annotation, and presentation of individual whole genome variation

Subject: Databases & Ontologies

Presenting Author: Brandon Wilk, Medical College of Wisconsin

Author(s):
Samual Flynn, Medical College of Wisconsin
George Kowalski, Medical College of Wisconsin
Jeremy Harris, Medical College of Wisconsin
Mary Shimoyama, Medical College of Wisconsin
Elizabeth Worthey, Medical College of Wisconsin

Abstract:
The price of whole genome sequencing has decreased drastically since the emergence of high throughput sequencing technologies leading to rapidly increasing numbers of published whole genome or exome sequences. Individual publications and larger consortia have released many individual human genomes to the public domain. However, it is often difficult to find and compile these individual human genomes into a single repository due to large file sizes, different file formats, different methods of calling and denoting genomic variations. The datasets themselves are often dispersed throughout many sites, which store and distribute the files in different ways. The Variant Annotation, Listing and Classification Repository with Interface Environment (VALCRIE) system provides a central location for the storage and annotation of variants using a defined schema. It integrates this structure with a program capable of loading several common variant annotation files (including GFF, GFF3, GVF, VCF, CompleteGenomics TSV) systematically into the repository. VALCRIE also provides a simple User Interface (UI) to query and visualize genomic variants across multiple genomes. VALCRIE will allow users to quickly and easily carry out cross genome analyses using efficiently and consistently stored genomic variant datasets. Researchers will be able to set up VALCRIE locally and load preprocessed and formatted files (made available at our site). VALCRIE will also continue to support the community of users with frequent updates of software, schemas and additional genomes moving forward. The system, available data, data formats, and analysis capabilities will be presented.


top
A19:
Mining and Annotation of Gene Lists: A comparative study

Subject: Databases & Ontologies

Presenting Author: Sean Fenstemaker, Ohio University

Author(s):
Zachary Abrams, Ohio University
Mason Armbruster, Ohio University
Shannon Clay, Ohio University
Kristine Garcia, Ohio University
Marilyn Hayden, Ohio University
Travis Johnson, Ohio University
Kaysi Lyall, Ohio University
William Presley, Ohio University
Olivia Thompson, Ohio University
Daniel Williams, Ohio University
Timothy Williams, Ohio University
Sarah Wyatt, Ohio University

Abstract:
High-throughput DNA and protein technologies generate extensive gene lists. Bioinformatics tools are needed to assist those conducting research to minimize and organize these lists and to better identify genes of interest. The focus of this project was to compare the following five tools: AraCyc, DAVID, DEFOG, GOrilla, and STRING. Each program was used to analyze a list of 346 differentially expressed genes generated via microarray analysis. These programs utilized a variety of parameters appropriate for each statistical analysis. The results were then analyzed to determine the type of output, and the variety of analyses that could be performed, as well as the visual display of the data generated. These programs have different features, so it is crucial to be able to determine which programs are appropriate for use in different methods of research or research goals. The objective was to give potential users a better understanding of the usefulness: benefits and drawbacks of each tool. Special thanks to the Choose Ohio First for Bioinformatics Scholarship Program at Ohio University for supporting this work.


top
A20:
Comparative Analysis of Intrinsic Disorder in Arenaviruses Proteins Using Bioinformatic Prediction

Subject: Evolution & Comparative Genomics

Presenting Author: Jonathon Combs, University of Findlay

Author(s):
Shawn Warner, University of Findlay

Abstract:
Intrinsically disordered protein regions (IDRs), defined as regions of high conformational flexibility and a lack of stable secondary structure, are often involved in vital biological interactions. This study compares bioinformatic predictions of intrinsic disorder among different species of the viral genus arenavirus. This genus contains many species endemic in South America and West Africa, and represents and important group of human pathogens. Prediction of IDRs from public sequence data was performed using the PONDR ® VL-XT algorithm. Statistical analyses were performed to test for significant differences in IDRs among species. Arenavirus proteins were shown to have differential levels of disorder between species. These differential levels of intrinsic disorder have been previously shown to correlate with characteristics such as virulence and immune evasion, which appears to be the case in arenavirus species.


top
A21:
Metabolic Network Analysis of Apicomplexan Parasites to Identify Novel Drug Targets

Subject: Evolution & Comparative Genomics

Presenting Author: Stacy Hung, Univeristy of Toronto

Author(s):
James Wasmuth, Hospital for Sick Children
Michael Grigg, National Institutes of Health
John Parkinson, University of Toronto

Abstract:
We are interested in studying the metabolic network of apicomplexan parasites, which includes Plasmodium falciparum, the causative agent for the most severe forms of malaria, and Toxoplasma gondii, which is responsible for food-borne illness that are health threats in HIV+/AIDS and immunocompromised populations. By applying systems-based methods in the context of biochemical pathways, we can better understand the metabolic potential of apicomplexans enabling for the identification of viable enzyme drug targets. We have accurately reconstructed the networks for 14 apicomplexans, and comparative analyses have confirmed the presence of a highly conserved ‘core’ of enzymes along with those that are lineage-specific suggesting these parasites have evolved different strategies for performing similar metabolic activities. Furthermore, candidate enzymes in the pantothenate biosynthesis pathway have been identified that are of therapeutic interest, which we are characterizing through gene knockout studies in Toxoplasma. To examine the in vivo landscapes of parasite metabolism, we have obtained high quality RNA-Seq datasets, providing deep coverage of blood-stages for P. falciparum and metabolically active stages of Toxoplasma and closely related Neospora. By overlaying expression data onto the network, we can apply comparative transcriptomics to highlight conserved expression patterns and differentially expressed pathways that might explain the ability of these parasites to survive in such a wide range of hosts. These findings provide insight into metabolic adaptations of apicomplexans and with an improved metabolic reconstruction for apicomplexans, we believe more meaningful system-based studies can be performed that serve to generate real, testable hypotheses to help focus future drug-discovery programs.


top
A22:
Variation through Sex within Viral Species

Subject: Evolution & Comparative Genomics

Presenting Author: Alex Kula, Loyola University Chicago

Author(s):
Zachary Romer, Loyola University Chicago
Catherine Putonti, Loyola University Chicago

Abstract:
The reassortment of segments in RNA viruses has proved to be a common pathway in the change of viruses. Various reassortment-modeling techniques have shown to be innovative in predicting certain RNA reassortment patterns. While most models have been developed for reassortment events in viruses infecting humans, reassortment does occur within viruses infecting other animals, plants and bacteria. Due to different lifestyles of the host, different parameters must be considered. In an effort to better understand the role of reassortment within the RNA-based bacteriophages, a model was developed to simulate reassortment within these viruses and their bacterial host(s). Using this model, we have conducted a series of simulations which in conjunction with empirical work, elucidate reassortment within phage.


top
A23:
Genetic diversity of Brucella species revealed by simple sequence repeat markers

Subject: Evolution & Comparative Genomics

Presenting Author: Wenxiao Liu, University of Michigan

Author(s):
Qingmin Wu, China Agricultural University
Yongqun He, University of Michigan

Abstract:
Brucella is a facultative, Gram-negative bacterium that causes zoonotic brucellosis in humans and a variety of animals. Genome-wide screening of DNA sequences of Brucella strains revealed tens of thousands of simple sequence repeats (SSRs) with the length of 1-6 bp. SSRs influence the virulence and host adaptation of pathogenic bacteria. In this study, the genomes of three B. melitensis strains (NI, 16M, ATCC 23457) and two B. abortus strains (2308, 9-941) were analyzed for frequencies and abundance of SSRs using simple string-matching algorithm. Our analysis revealed that SSRs are distributed throughout Brucella genomes with a varied tract density (i.e., the number of SSRs/10 kb). The tract density of B. abortus strain 2308 is 471 and other four Brucella strains at the average of 440. Six types of SSRs (monomer, dimer, trimer, tetramer, pentamer and hexamer) were observed in the Brucella genomes. Thirteen percent of SSRs is located in the noncoding region of NI and 16M. The other three strains have 17% of SSRs in their noncoding regions. Over 85% of SSRs in the coding region belong to di, tetra, and penta motifs. These SSRs may cause frame shift mutation of Brucella genes. Among the five genomes, 71% of the nucleotides of mono- and dinucleotide SSRs are composed of A’s or T’s. As a genome-wide variation, SSR polymorphism could prove to be an important part of the evolutionary ecology of microbes.


top
A24:
Cluster Analysis and Co-expression study of Mycobacterium tuberculosis for Genome Wide Microarray Expression Data

Subject: Evolution & Comparative Genomics

Presenting Author: Utkarsh Raj, Gautam Buddha University

Author(s):
Monika Kumari, Gautam Buddha University

Abstract:
Mycobacterium tuberculosis is one of the most common Gram-positive, highly aerobic bacterial pathogen in human that inhibits lung functioning may result in tuberculous pleuritis, a condition that may cause symptoms such as chest pain, non productive cough and fever Moreover, infection with M. tuberculosis can spread to other parts of the body, especially in patients with a weakened immune system. This condition is referred to as miliary tuberculosis, and people contacting it may experience fever, weight loss, weakness and a poor appetite.Till the mechanism of pathogenesis in humans remains largely unknown. Microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Numerous microarray studies of understanding the mechanism of pathogenesis for M.tuberculosis have been conducted. Gene clustering analysis is found useful for discovering groups of correlated or coexpressed genes potentially co-regulated or associated to the disease or conditions under investigation. In this current study we have conducted cluster analysis for M.tuberculosis for identification of co-expressed gene by using Microarray raw data sets available at Stanford Microarray Database. There are different approaches to analyse the large-scale gene expression data in which the essence is to identify gene clusters. This approach has allowed us to determine expression profiles of novel developmentally regulated genes. Finally we get the some genes which highly coexpressed and may be involved in pathogenesis. Those genes are also novel developmentaly regulated genes and can be used as Drug target.


top
A25:
The Event-Driven Method to Investigate Operon Evolution

Subject: Evolution & Comparative Genomics

Presenting Author: David Ream, Miami University

Abstract:
Operons are important bacterial genomic features, whose evolution is poorly understood. Operons usually contain many related genes under a common regulatory control mechanism. Often, operonic genes are organized in the same order that they catalyze the reactions in their metabolic pathway. These features allow operons to not only to control large blocks of bacterial metabolic functions, but allow researchers a method to assign function to unknown genes.
Several models of operon evolution have been proposed, but none offer a universal method to investigate all types of operons. Our work aims to discover which models operate, and under what conditions they drive operon evolution. To make this determination, we developed a methodology which identifies the chromosomal events which change an operon's structure. We describe these as attributes, which allow us to quantitatively classify these changes and assign an associated cost. Each attribute relates to the amount of change from the reference operon, found in the gold-standard E. coli, to each target organism. The events we measure in the changing operon structure are their breaking up, group conservation, gene loss, gene duplication, and gene fusion. By correlating the attribute measures to evolutionary time we demonstrate trends in operon evolution relating to metabolic function. Surprisingly, we see unexpected trends in operon evolution that cannot be explained by phylogeny alone.


top
A26:
A Bioinformatic Approach to Identifying CRISPR-associated Immune Defense

Subject: Evolution & Comparative Genomics

Presenting Author: Zachary Romer, Loyola University Chicago

Author(s):
Catherine Putonti, Loyola University Chicago

Abstract:
lustered regularly interspaced short palindromic repeats or CRISPRs have recently been identified in half of all bacterial and nearly all archaeal genomic sequences. CRISPRs provide the bacteria/archeaea with “immunity” against viruses and plasmids, recognizing “foreign” DNAs which match spacer sequences located within the CRISPR loci. These 26- to 72 base pair spacer sequences within the microbial genomes are identical to the corresponding phage or plasmid genomic sequence.

Identifying the source of spacer sequences within the genomes of bacteria and archaea can provide insight into the individual microbe’s resistance and prior exposure to particular bacteriophages and/or plasmids. Detecting CRISPR sequences has typically relied on the identification of the 21- to 48-base pair directed DNA repeats which separate the spacers. While not originally designed for detecting CRISPR repeats, existing tools have been quite successful in identifying spacer sequences. The vast majority of these spacer sequences, however, do not BLAST to any known viral genome sequences.

Herein we present a new software tool which utilizes known virus genomes and our knowledge of the structure of CRISPR loci. Rather than target identification of the repetitive elements, our approach looks specifically for spacer sequences. This tool was developed with the specific purpose of identifying CRISPRs in unassembled metagenomic next-generation sequencing reads. Using this tool we examined both publicly available genomic sequences as well as metagenomic sequence collections.


top
A27:
Characterizing Indel Evolution in Bird Genomes

Subject: Evolution & Comparative Genomics

Presenting Author: Ted Vlahos, Loyola University Chicago

Author(s):
Kamil Slowikowski, Loyola University Chicago
Sushma Reddy, Loyola University Chicago

Abstract:
In this genomic era, there is an abundance of genetic data, yet use of this information to understand how genes, genomes, and organisms have changed over time is still in its infancy. Birds are the most diverse group of terrestrial vertebrates and better knowledge of their genomes will help to understand how vertebrates have evolved. DNA insertion and deletion mutations, indels, are significant contributors to the evolution of coding and non-coding DNA sequences. It has been shown that indel mutations are not randomly distributed throughout genomes; insertion and deletion frequencies are highly dependent upon sequence framework. The main objective of this project was to acquire a greater understanding of the mechanisms pertaining to the evolution of indel mutations in birds. Using comparative genomic analysis, a data set containing approximately 50 kilobases of aligned nuclear DNA sequences from 19 independent loci for 200 species, representing the full extent of the diversity of modern birds. We examined the prevalence of indels of various size, the correlation of indels with base composition, and examined possible deletion biases across the genomic regions. We have additionally developed a user interface algorithm that produces these indel statistics and graphical images for any sequence file.


top
A28:
Predicting polar auxin transport regulation in Isoetes through conserved protein functions in disparate evolutionarily lineages

Subject: Evolution & Comparative Genomics

Presenting Author: Tim Williams, Ohio University

Author(s):
Gar Rothwell, Oregon State University
Sarah Wyatt, Ohio University

Abstract:
The evolution of bipolar growth initiating in the embryo allowed two plant lineages to evolve large trees: arborescent lycophytes and seed plants, represented by the model species Isoetes and Arabidopsis, respectively. We sought to explore the relationship of polar auxin transport in the evolution of bipolar growth, but the model of polar auxin transport regulation with PIN proteins in Arabidopsis cannot explain polar auxin transport in arborescent lycophytes which lack the plasma membrane-localized PINs novel to seed plants. A null hypothesis of conserved protein function and regulatory interactions was tested with predicted functional changes and with concurrency of protein presence in lineages of the plant phylogenetic tree. The new model of polar auxin transport in non-seed plants relies on PGPs for all polar auxin transport. In Arabidopsis, members of the PIN and PGP families have identical functional potential as directionally localized auxin transporters. Additionally, PGP endosomal recycling occurs via the same pathway as PIN recycling allowing similar regulation mechanism. The reversed auxin transport polarity between the leafy and rooting shoots of arborescent lycophytes likely is established by this unidentified, invertible PGP regulation. Further comparison between regulation of polar auxin transport and bipolar growth in Arabidopsis and Isoetes serves as a case study in parallel evolution of complex traits between disparate plant lineages.


top
A29:
Phylogeography of Brachypteryx montana populations in the Philippines

Subject: Evolution & Comparative Genomics

Presenting Author: Mark Wojdyla, Loyola University Chicago

Author(s):
Bushra Alam, Loyola University Chicago
Sushma Reddy, Loyola University Chicago

Abstract:
Title: Phylogeography of Brachypteryx montana, a widespread bird across the islands of the Philippines

Authors: Mark Wojdyla, Bushra Alam, and Sushma Reddy

The rich and unique biodiversity of the Philippine archipelago has been shaped by past climatic and geological conditions. We conducted a phylogeographic analysis of a widespread bird species, Brachypteryx montana,in order to examine its diversification across the islands of the Philippines. Genetic variation in these birds was examined by sequencing two mitochondrial genes, ND3 (351 bp) and ND2 (1047 bp), and one nuclear intron, AC01 (1012 bp), for more than 200 individuals across the different islands. We will compare the signal from mitochondrial genes, which evolve faster, to nuclear genes in tracking the divergences of different populations of this species. The results of our phylogenetic analysis were consistent with those found in other vertebrates such that each island hosted unique clades and populations on geographically proximate islands were closely related. With this analysis, we reconstructed the evolutionary history of divergence in these tropical montane forest dwellers and contributed to the knowledge of past connections between Philippine islands.


top
A30:
Lineage-specific Expansion of Green Algal Gene Families Relevant to Lipid Metabolism

Subject: Evolution & Comparative Genomics

Presenting Author: Guangxi Wu, Michigan State University

Abstract:
Green algae are photosynthetic eukaryotes derived from the same ancestor as higher plants. Some of them are sources of petroleum deposits due to their high lipid content. Under stress conditions, green algae further increase their lipid content, and thus may be suitable for biofuel production. However, genes involved in lipid metabolism have not been studied extensively in a comparative genomics context to address how lipid metabolic mechanism has evolved in this lineage. Here, 4,206 protein domain families in Chlamydomonas reinhardtii and eight other green algal species are analyzed with an emphasis on genes involved in lipid metabolism. For each family, a phylogenetic analysis was conducted and orthologous groups identified to uncover their evolutionary history. For conserved families, the conservation across nine green algal species could suggest fundamental function in lipid metabolism. For lineage-specifically expanded families, though gene duplication is an important way of generating raw genetic material for evolution, most lead to the rapid elimination of one copy of the gene due to lack of functional importance. To assess that, Ka/Ks test and available stress condition expression datasets in C. reinhardtii will be used to find out the selection pressure and expression pattern changes. The quality of current genome annotations is also assessed to quantify the impact it might have on our analysis. Lipid metabolism genes specifically expanded in each algal lineage, supported by evidences of functional importance, could be the potential targets of genetic and biochemical studies to further examine the species-specific aspect of lipid metabolism in green algae.


top
A31:
Genome-wide analysis of DNA tandem repeats in 31 species of algae and plants

Subject: Evolution & Comparative Genomics

Presenting Author: Zhixin Zhao, Miami University

Author(s):
Chun Liang, Miami University

Abstract:
DNA Tandem Repeats (TRs), defined as at least two adjacent motifs, extensively exist in eukaryotic genomes. TRs are extremely unstable in evolution, since their mutation rates are much higher than average genomes. Most mutations of TRs are caused by the variation of repeat number rather than point mutation. It is believed that the distribution of TRs is non-random, and they are often located within genes and other regulatory regions. Taking the advantage of 31 sequenced genomes in Phytozome v8.0(http://www.phytozome.net/), we investigated the characteristics and distributions of TRs in green algae and land plants. Based on our data analyses, there is no co-relation between TR density and genome size. Our results also show that TRs distribution has strong preference in different genic regions (e.g., 5'-UTR, CDS, intron and 3'-UTR). In green algae, intron regions have the highest TR density (~1.4 times of the whole genome average) while 5'-UTRs have the lowest (~1/5 of the whole genome average). Land plants have the highest TR density (~2 times of whole genome average) in 5'-UTRs and the lowest density in CDS region (~2/5 of genome average). Our results also suggest that the dominant TR motifs in the same genic regions appear to be similar for the species within the same group (i.e., green algae, monocots and dicots). This might be related with the GC content, because green algae, monocots and dicots have distinct features in term of GC contents in the whole genome and the genic regions in our data analysis.


top
A32:
Factor Analysis for E2Fs in Myc Inducted Breast Cancer

Subject: Gene Regulation & Transcriptomics

Presenting Author: Danielle Barnes, Michigan State University

Abstract:
The E2F family of transcription factors has well defined roles in mediating cell cycle progression and apoptosis. However, we have recently demonstrated that ablation of the E2Fs can differentially regulate Myc induced tumors, revealing underlying mechanisms specific to E2F1, E2F2, and E2F3. Specifically, the loss of transcription factor E2F1 in tumors is associated with an increased growth rate, a reduction in apoptosis and a decrease in tumor latency. Discovery of specific pathways associated with E2F1 activity is crucial for researchers, as breast cancer is well known for its heterogeneity. Microarray gene expression data has been collected for Myc induced tumors in wild type and E2F mutant strains and yields over 22,000 gene expression values. An applied method is necessary which allows for a reduction in data dimensionality and identification of potential pathways regulated by the various E2F activator transcription factors. Currently, a signature for E2F1 exists, but is composed of factors such as apoptosis, proliferation, and metastases. I have applied a method of Factor Analysis which aims to reduce the high-dimension of data present and offers the potential for discovery of previously unknown factors within the signature for E2F1 activation. Using these factors in combination with the known signature, various components of E2F1 activation can be expressed in specific terms of function instead of general activation of E2F1. Activation of E2F1 components is then utilized, allowing for future research to target specific functions of the E2F1 transcription factor in assessing potential tumors and treatment.


top
A33:
Transcriptomic profiling of leptin-A knockdown in early zebrafish (D. rerio) development

Subject: Gene Regulation & Transcriptomics

Presenting Author: Mark Dalman, University of Akron

Author(s):
Anthony Deeter, University of Akron
Zhong-Hui Duan, University of Akron

Abstract:
Leptin is a 16 kD circulating cytokine protein that is well known for its implication in human obesity and its significant impacts on development, immune function, and metabolism amongst others; affirming its role as a pleiotropic hormone. Despite over 30,000+ articles on leptin to date, <1% have looked at non-mammalian vertebrates with even less looking at early development using a transcriptomic approach. We have previously characterized the knockdown of leptin-A and its receptor in early zebrafish (D. rerio) development, however very little is known of how leptin-A knockdown influences overall gene expression, more importantly those pathways involved in metabolism, innate immune function, and development. Zebrafish, specifically, have garnered much attention in transcriptomics due to a completed and well-annotated genome along with commercially available genechips. Our study is the first to use a recently annotated Affymetrix Gene 1.1 ST Array strip to test for differences at the transcriptomic level in leptin-A morphants during early development. Preliminary analysis of the data indicates markers of aerobic metabolism are attenuated along with lipolytic pathways responsible for the mobilization of TAG rich yolk sac. Furthermore, there was a decrease in overall signal transduction and vesicular trafficking indicating a reduced efficiency in signaling and/or reduction in energy utilization for peripheral signals. Subsequently, innate immunity parameters were significantly affected suggesting that energy mobilization may still be occurring despite leptin null mutants in a perceived state of starvation.


top
A34:
A Hidden Markov Model to Identify Combinatorial Epigenetic Regulation Patterns for Estrogen Receptor α Target Genes

Subject: Gene Regulation & Transcriptomics

Presenting Author: Russell Bonneville, The Ohio State University

Author(s):
Victor Jin, Ohio State University

Abstract:
Many studies have showed that epigenetic changes, such as altered DNA methylation and histone modifications, are linked to ERα-positive tumors and disease prognoses. Several recent studies have applied high-throughput technologies such as ChIP-seq and MBD-seq to interrogate the altered architectures of ERα regulation in tamoxifen resistant breast cancer cells. However, the details of combinatorial epigenetic regulation of ERα target genes in breast cancers with acquired tamoxifen resistance have not yet been fully examined. We developed a computational approach to identify and analyze epigenetic patterns associated with tamoxifen resistance in the MCF7-T cell line compared to the tamoxifen sensitive MCF7 cell line, and ultimately to understand the underlying mechanisms of epigenetic regulatory influence on resistance to tamoxifen treatment in breast cancer. In this study, we used ChIP-seq of ERα, PolII, three histone modifications and MBD-seq data of DNA methylation in MCF7 and MCF7-T cells to train hidden Markov models (HMMs). We applied the Bayesian Information Criterion (BIC) to determine that a 20-state HMM was best, which was reduced to a 14-state HMM with BIC score 1.21296E7. We further identified four classes of biologically meaningful states in this breast cancer cell model system, and a set of ERα combinatorial epigenetic regulated target genes. The correlated gene expression level and gene ontology (GO) analyses showed that different GO terms were enriched with tamoxifen resistant vs sensitive breast cancer cells. Our study illustrates the applicability of HMM-based analysis of genome-wide high-throughput genomic data to study epigenetic influence on E2/ERα regulation in breast cancer.


top
A35:
Similarities and Differences among Muscles Diseases based on Gene Co-Expression Network Alignment

Subject: Gene Regulation & Transcriptomics

Presenting Author: Sung-Min Kim, University of California, San Diego

Author(s):
Shakti Gupta, University of California, San Diego
Yu Wang, University of California, San Diego
Ashok Dinasarapu, University of California, San Diego
Fadi Towfic, Iowa State University
Shankar Subramaniam, University of California San Diego

Abstract:
Various muscular dystrophies, motor neuron diseases, and inflammatory myopathies affect millions of individuals each year, causing degeneration of skeletal muscle and premature death. Bioinformatics and systems biology allows us unique opportunity to investigate the gene structures of these diseases. Herein, we have utilized four families of skeletal muscle functional genes (mechanical, metabolic, etc.) to study six different muscle diseases. The skeletal muscle profile data was downloaded from Gene Expression Omnibus (GEO) for patients of six different muscle diseases including Amyotrophic Lateral Sclerosis (ALS) and for normal subjects. We utilized a pairwise network alignment algorithm (BiNA) to compare all seven gene co-expression networks. The genes under five families of skeletal muscle function were used for local alignment. Based on the alignment scores, hierarchical clustering was used to visualize relationship across the networks representing the gene expression changes due to different muscle diseases. The clustering results indicated a significant amount of differences in ALs vs. all other conditions. To further delineate molecular and biological mechanisms causative and consequential to the ALS disease, we performed t-tests between ALS and all other conditions (p-values<=0.001). The intersection between the t-test results produced the list of genes that are differently regulated only in the ALS diseased. We then generated transcription factor networks by mapping the transcription factors to target genes presented in ALS regulated gene list. The transcription factor network of the ALS disease revealed that genes such as HNF4A, FOXA2, and NR3C1, and their associated targets, are heavily involved in these conditions.


top
A36:
Characteristics and Significance of Intergenic PolyA RNA Transcription in Arabidopsis thaliana

Subject: Gene Regulation & Transcriptomics

Presenting Author: Gaurav Moghe, Michigan State University

Author(s):
Melissa Lehti-Shiu, Michigan State University
Alex Seddon, Michigan State University
Shan Yin, Michigan State University
Yani Chen, Michigan State University
Federica Brandizzi, Michigan State University
Piyada Juntawong, University of California - Riverside
Julia Bailey-Serres, University of California - Riverside
Shin-Han Shiu, Michigan State University

Abstract:
The Arabidopsis thaliana genome has over 27,000 protein-coding genes and is the most well annotated plant genome. However, recent transcriptome sequencing suggests the presence of several novel intergenic polyA transcripts. It is not clear whether these transcripts can be translated and whether these novel transcripts represent functional genes. In this study, we first assessed the extent of intergenic polyA transcription using eight mRNA-Seq datasets and found that Intergenic Transcribed Fragments (ITFs), while ranging from hundreds to thousands across all datasets, occupy only a tenth of the intergenic space. We assessed the potential functionality of ITFs based on breadth and level of expression, association with the ribosomal machinery and primary sequence conservation. Most ITFs were identified as short, lowly expressed, dataset-specific transcripts lying close to annotated genes. Through analyses of translatome and proteome datasets, ~35% ITFs were likely translated. However, ITFs closer to genes were significantly more likely to be ribosome-associated, suggesting that they may be part of annotated transcriptional units. The sequence-level conservation of ITFs was assessed based on comparison between A. thaliana and A. lyrata and between 80 accessions of A. thaliana. We found that only ~15% of the ITFs have significant purifying selection. Overall, our comprehensive analyses of the A. thaliana polyA transcriptome reveal that, despite the prevalence of ITFs, most do not display evidence of purifying selection. Thus, we cannot rule out the possibility that they are products of spurious transcription. Nonetheless, these apparently neutrally evolving ITFs may underlie an important mechanism in creating evolutionary novelty.


top
A37:
Splice Variants Detection Using RNA-Seq Assembly and Digital Normalization

Subject: Gene Regulation & Transcriptomics

Presenting Author: Likit Preeyanon, Michigan State University

Author(s):
C. Titus Brown, Michigan State University
Hans Cheng, USDA, ADOL, ARS, Michigan State University
Jerry B. Dodgson, Michigan State University

Abstract:
Recently, RNA sequencing (RNAseq) from Next Generation Sequencing (NGS) technology has been successfully used to study alternative splicing in humans and mice. Methods used in these analyses rely solely on high quality gene models. Consequently, these methods are not suitable for other organisms lacking high quality gene annotations. To overcome this problem, other methods have been developed; for example, one method does not rely on an existing annotation but instead constructs the gene models from sequence reads that are mapped to the genome. However, the method is limited by using only sequence reads that are mapped to the genome and gene models are built based on a computational prediction.

We have developed a pipeline, based on an assembly approach, that builds gene models and identify alternative isoforms from RNA-Seq data. We have been using this method to study alternative splicing in chickens line 6 and 7 that are resistance and susceptible to Marek’s disease (MD). The method identified many novel genes and isoforms that are not included in existing gene models. The pipeline does not rely on existing gene annotations, therefore, it can be applied to study alternative splicing in any organism. Moreover, we are developing a technique that intelligently reduces a significant amount of RNA-Seq data and sequencing errors to facilitate an assembly process.


top
A38:
Alternative Splicing of protein coding and non coding genes in Chlamydomonas reinhardtii

Subject: Gene Regulation & Transcriptomics

Presenting Author: Praveen Raj Kumar, Miami University

Author(s):
Nicholas Uth, Miami University
Chun Liang, Miami University

Abstract:
Pre-mRNA splicing is one of the fundamental post-transcriptional processes in eukaryotic gene expression and regulation. Alternative Splicing (AS) occurs when different splice sites of pre-mRNA are processed to generate distinct transcript isoforms from the same genes, leading to diverse proteins with functional and/or structural differences. Pre-mRNA splicing is guided by the cis-regulatory sequence motifs buried in them; currently in mammals these are well characterized as the consensus splice sites, branch point signal and poly-pyrimidine tract. Focusing on Chlamydomonas reinhardtii, we evaluate AS events and their associated cis-regulatory signals in the green alga. Based on all available Sanger-based ESTs (338,243) and 454 cDNAs (7,007,189), we evaluated AS events using GMAP and PASA, and updated AUGUSTUS gene annotation consolidated with resultant PASA AS models. Among 16,237 AUGUSTUS multi-exon protein-coding genes, 52.2% are subjected to AS while 7.85% of PASA deduced multi-exon non-coding genes (1,969) shows AS evidence. As observed in other plants analyzed so far, we found the dominant AS mode is intron retention in both protein-coding (44%) and non-coding genes (42.75%). In comparison with the constitutively spliced introns, we found that the retained introns tend to have weaker splice sites, with less abundant G triplet (intronic splicing enhancer) and C triplet. We also observed that most retained introns are less than 250 nt, which have a potency of being spliced by intron definition splicing mechanism.It suggests that short introns that are spliced by intron definition are more likely to be retained if they possess weak signals.


top
A39:
Genome-wide detection of mRNA splicing mutations using information theory-based binding site models

Subject: Gene Regulation & Transcriptomics

Presenting Author: Ben Shirley, University of Western Ontario

Author(s):
Eliseos J Mucaki, University of Western Ontario
Pelin Akan, Royal Institute of Technology
Peter K Rogan, University of Western Ontario

Abstract:
The interpretation of mutations in high-throughput genome-wide sequencing data has traditionally focused on amino acid coding prediction and analysis. However, gene mutations also affect transcriptional and post-transcriptional processes, including mRNA splicing mutations which are well documented in genetic diseases. We have previously demonstrated that information theory based methods are a sensitive and specific approach to detecting and quantifying the effects of splicing mutations. This paper describes software for genome-scale analysis of splicing mutations based on comparing individual information contents (Ri, in bits) of binding sites in reference and variant sequences. Relative affinities of these interactions are based on differences in their respective Ri values. A client-server architecture has been implemented for the CLC-Bio workbench. After information analysis on the CLC-Bio Genomics Server, results can be viewed either in Manhattan plot format, dynamically filtered tabular output, or as BED tracks on a genome browser. The tabular format provides additional context for potential mutations, sorting capability, and identifies common SNPs. The results accurately predict mutational effects on natural and cryptic mRNA splicing, and can distinguish null mutations from leaky mutations that reduce splice site use. The software analyzes 211,351 variants from the U2OS osteosarcoma cell line in ~14 hours on an I7-based server. Novel variants (after eliminating variants in dbSNP130) included 3 inactivating and 11 leaky natural splice site variants, and 55 cryptic splice sites exceeding the strength of the adjacent natural site. These are tractable numbers of potentially pathogenic variants suitable for further laboratory investigation of effects on splicing.


top
A40:
Cis-regulatory code of stress responsive gene expression in plants

Subject: Gene Regulation & Transcriptomics

Presenting Author: Shin-Han Shiu, Michigan State University

Author(s):
Alexander Seddon, Michigan State University
Cheng Zou, Chinese Academy of Agricultural Sciences

Abstract:
Environmental stress leads to significant changes in gene expression that are central to plant survival. Although there are well studied examples of a few plant cis-regulatory elements (CREs) that function in stress regulation, the plant stress cis-regulatory code, i.e., how CREs work independently and/or in concert to specify stress-responsive transcription, is mostly unknown. We identified a large number of putative CREs through analysis of the transcriptional response of Arabidopsis thaliana above-ground tissues to multiple stress conditions. Surprisingly, biotic and abiotic responses are mostly mediated by distinct pCRE superfamilies. In addition, using machine learning approaches, we uncovered cis-regulatory codes specifying how pCRE presence and absence, combinatorial relationships, location, and copy number can be used to predict stress-responsive expression. Using salt stress response as an example, we showed that cis-regulatory code based on above-ground tissue expression can be used to predict response in roots and, most importantly, in rice, a plant species that diverged from A. thaliana ~150 million years ago. The discovery of these cis-regulatory rules significantly advances our understanding of plant stress transcriptional response. In addition, our ability to apply cis-regulatory logic across tissue types and species highlights the robustness of the regulatory rules and their utility in translational research.


top
A41:
Identification of The Protein Network That Associates With Long Non-coding RNA

Subject: Gene Regulation & Transcriptomics

Presenting Author: BING ZHANG, Case Western Reserve University

Author(s):
Marzieh Ayati, Case Western Reserve University
Lalith Gunawardane, Case Western Reserve University
Mehmet Koyuturk, Case Western Reserve University
Saba Valadkhan, Case Western Reserve University

Abstract:
One of the most significant discoveries resulting from studies on the human genome has been the elucidation of the broad existence of non-protein coding RNAs. The results of the ENCODE project and other large scale transcriptome analyses suggest that while over 93% of the human genome is transcribed into RNA, ORFs and their associated UTRs occupy only 2% of the genome. It is estimated that large non-protein coding transcripts (lncRNAs) constitutes a major portion of the information output of the human genome. However, many aspects of the biology and functional mechanism of this novel class of cellular regulators are almost completely unknown. We have analyzed a ~2700 nucleotide long nuclear lncRNA that is involved in neuronal differentiation and cellular stress resistance pathway. To characterize the exact functional mechanism behind this complex and multi-faceted function of the lncRNA, we developed a technique to purify the endogenous lncRNA together with its interacting proteins in vivo. Using Mass-Spectrometry technology we have successfully identified a significant number of proteins that interact with this lncRNA and likely contribute to the formation of the functional RNPs. To determine whether the RNA serves as a scaffold for protein-protein interactions and to detect the protein complexes which interact with the RNA, we have analyzed the proteins which interact with our RNA using the existing protein-protein interaction network. Our results indicate that the lncRNA may function as a signal hub for bringing several functional protein units together, thus facilitating interactions which enable the RNA to perform complex cellular functions.


top
A42:
Mass Spectrometry Analysis of Recombinant Cetacean Leptin

Subject: Mass Spectrometry & Proteomics

Presenting Author: Hope Ball, The University of Akron

Abstract:
Mass spectrometry has been a vital tool in the identification of proteins, their various interactions and determinations of the existence and type of post translational modifications. The application of liquid chromatography (LC) and tandem mass spectrometry (MS/MS), together LC-MS/MS, allows for determination and analysis of target peptides from complex mixtures of proteins and has previously been important in large scale analyses of yeast proteomes (2) and examinations of protein expression in sectioned mammalian tissues (3). However, to date, this method has never been applied to studies of leptin protein expression. Leptin, a 16kDa peptide hormone encoded by the obese (ob) gene and secreted by adipose (fat) cells (1), is best known for its role in the regulation of energy stores and food intake where the presence of the protein acts to decrease appetite and increase metabolic rate. Arctic-adapted cetaceans build large adipose stores (blubber) and maintenance of these large adipose stores poses questions about the physiological effects of leptin in these animals. Here, LC-MS/MS analyses characterized signal peptides from full-sequence recombinant cetacean leptin. Future work will apply these signal peptides to characterize and detect wild-type leptin proteins from sera samples of wild and captive cetaceans through the use of LC-MS/MS mass spectrometry analyses. Comparisons of these resulting specta will allow for the detection of leptin protein, as well as any post-translational modifications in wild-type samples, using LC-MS/MS mass spectrometry.


top
A43:
Progress and Opportunities in the HUPO Human Proteome Project (HPP)

Subject: Mass Spectrometry & Proteomics

Presenting Author: Gilbert Omenn, University of Michigan

Author(s):
William Hancock, Northeastern University
Michael Snyder, Stanford University

Abstract:
The Human Proteome Organization (HUPO) announced the global Human Proteome Project at the Sydney World Congress in September 2010 and launched the HPP at the Geneva World Congress in September 2011. The goal is to identify and characterize at least one protein product from each of the 20,300 protein-coding genes. The timing reflects dramatic advances in the resource pillars for the HPP—a wide array of mass spectrometry platforms; the antibody-based Human Protein; and ProteomeXchange to integrate proteomics knowledge-bases. The HPP is organized into two investigative arms: a chromosome-centric HPP, with consortia so far, and a complementary effort to facilitate extensive biology and disease driven proteomics projects. Progress will be illustrated with findings from the breast-cancer-driven Chromosome 17 project.


top
A44:
Microbiome Community Dynamics: Upstream Methodology and Translational Affects on Diversity

Subject: Metagenomics

Presenting Author: Wendy Demos, Medical College of Wisconsin

Abstract:
Microbiome studies aim to characterize the microbial communities found at different sites on the human body in order to advance our understanding of human microbe relationships and the pathogenesis of human disease. These approaches are already providing clinical insight into a variety of disease states including Cancer, Oral health, Otitis Media, and Crohns disease. Additionally they are being used to study the effect of drug treatments (such as antibiotic intervention) in altering microbiota composition, which will ultimately aid physicians in treatment and management of such diseases.
Roche 454 pyrosequencing (and more recently other short read sequencing) of the 16S rDNA gene has become integral to this approach.
A variety of measures can be derived from these studies; alpha and beta biodiversity measurements are extracted within and between sample analyses. Bacteria are grouped into operational taxonomic units for visualization of phylogenetic relationships and taxonomic composition and quantitative and qualitative measures are calculated to determine community membership and structure .
At MCW we are utilizing the Qiime pipeline along with additional statistical analyses. In one study we are measuring enteric bacteria biodiversity from ileal biopsy in pediatric subjects with newly diagnosed Crohn’s disease in order to find associations with the disease phenotype. We hypothesised that issues related to sample collection and preparation might affect the downstream measurements of taxonomic diversity, relative abundance, and phylogenetics. We will present our analysis and findings including data on sample preparation methodologies and the subsequent affect of an apparent loss in sample diversity.


top
A45:
Gene-Targeted Metagenome Assembly

Subject: Metagenomics

Presenting Author: Jordan Fish, Michigan State University

Author(s):
Qiong Wang, Michigan State University
C. Titus Brown, Michigan State University
Yanni Sun, Michigan State University
James Tiedje, Michigan State University
James Cole, Michigan State University

Abstract:
Very large metagenomes tax the abilities of current-generation short-read assemblers. In addition to space and time complexity issues, most assemblers are not designed to correctly treat reads from closely related populations of organisms. In addition, general assembly annotation pipelines may not be well tuned for analysis of specific important environmental genes. We are developing a gene-targeted approach for metagenome assembly. In this approach, information about specific genes is used to guide assembly, and gene annotation occurs concomitantly with assembly. This approach combines a space-efficient modified De Bruijn graph representation of the reads with a protein profile Hidden Markov Model for the gene(s) of interest. To limit the search, we use a heuristic to identify nucleotide kmers that translate to peptides found in a set of representatives of the target protein family. Contigs are assembled in both directions from these starting kmers by applying graph path-finding algorithms on the combined De Bruijn-HMM graph structure. Using this technique we have been able to extract complete nifH protein coding regions from a 50G Iowa prairie metagenome and buk (butyrate kinase) coding regions from a human gut metagenome. Future work will focus on improving search efficiency and separating sequencing artifacts from low-coverage rare populations.


top
A46:
Profile HMM-based Protein Domain Classification for Metagenomic Sequences

Subject: Metagenomics

Presenting Author: Yuan Zhang, Michigan State University

Author(s):
Yanni Sun, Michigan State University

Abstract:
Next generation sequencing technologies now enable us to directly sequence microbial genomes recovered from environmental samples. Nowadays the major challenges with metagenomics have shifted from generating to analyzing and annotating sequences. Protein domain classification is an important step in metagenomic annotation. It classifies putative protein sequences into annotated protein families and thus aids in functional analysis. However, there still exist several challenges which need to be addressed. First of all, metagenomic data sets are generally large, which requires very efficient algorithms for functional annotation. Secondly, when the reads are very short, existing tools have poor sensitivity for classification. Finally, some platforms such as 454 pyrosequencing have high error rates, which make existing tools such as BLASTP and HMMER unable to produce satisfactory performance. We proposed two tools: HMM-FRAME and MetaDomain. They are both based on profile Hidden Markov Models. HMM-FRAME can accurately predict and correct sequencing errors and thus can help to improve the performance of profile HMM-based tools such as HMMER. MetaDomain is used to align short reads to their native domain families with much higher sensitivity than current tools while maintaining a low false positive rate. It can also be used to evaluate expression levels of protein domains of interest in a given data set.


top
A47:
Meta-analysis of Variables Affecting Vaccine Protection Efficacy of Whole Organism Brucella Vaccine and Vaccine Candidates from Mouse Vaccine Protection Studies

Subject: other

Presenting Author: Omar Tibi, Eastern Michigan University

Author(s):
Yongqun He, University of Michigan
Thomas Todd, University of South Florida
Yu Lin, University of Michigan
Samantha Sayers, University of Michigan
Denise Bronner, University of Michigan
Zuoshuang Xiang, University of Michigan
Lesley Colby, University of Michigan

Abstract:
Vaccine protection studies have three distinct stages that cover the vaccination of the test animals, the pathogenic challenge of the test animals, and the examination of the vaccine efficacy after test animal sacrifice. Throughout these three stages, there are many variables that can affect the efficacy of the vaccine, such as the dosage of the experimental vaccine, how many injections of the vaccine were given, the pathogen strain of the vaccine, and the type of vaccine – whether it is a live attenuated vaccine or a killed vaccine. In this study, we used a meta-analysis to examine published literature (121 papers to be exact) with an aim to determine how 19 different variables affect the efficacy of Brucella live attenuated and killed vaccines in the mouse model. Due to many papers not having all 19 variables, 74 of the 121 papers retrieved were used. In total the data from 403 experimental groups of mice were extracted from the 74 papers and analyzed. Our ANOVA statistical study found that 9 variables are significant in affecting vaccine efficacy, including the vaccination route, vaccination dose, and the strain of the vaccine. Some other surprising findings included RB51 only inducing protection 80% of the time and that intranasal vaccinations were ineffective at inducing protection. Knowing the variables that affect vaccine efficacy in a significant way allows for the selection of the significant variables, greatly increasing the vaccine’s performance.


top
A48:
Changes in microbial community structure by three chemically defined polyphenols

Subject: other

Presenting Author: Michael Schmidt, Miami University

Author(s):
Javier Gonzalez , USDA-ARS
Jonathon Halvorson, USDA-ARS
Ann Hagerman, Miami University
Allison Kreinberg, Miami University

Abstract:
Tannins may affect soil microbial populations by serving as substrates for microbial growth or inhibiting microbial activity by binding to proteins. In this study we examined how three model tannins were respired by soil microorganisms and how the model tannins affected the abundance of total and ammonia oxidizing microorganisms in soil. At time points ranging from several hours to 14 days, subsamples were removed to measure population abundances using quantitative PCR (polymerase chain reaction). Total microbial populations were assessed using a sequence from the 16s rRNA gene and ammonia oxidizing species were measured using the gene encoding for the alpha subunit of the enzyme ammonia monooxygenase (amoA) for ammonia-oxidizing bacteria (AOB) and ammonia-oxidizing archaea (AOA). Changes in physiology were determined by Community Level Physiology Profiling (CLPP), in soil that had previously been incubated with individual polyphenols. The lowest molecular weight compound (methyl gallate) was a better substrate for respiration than the higher MW polyphenols (epigallocatechin gallate and oenothein B). The PCR data showed that none of the polyphenols supported microbial growth. The polyphenols selected for different populations, reflected by changes in the respiration of the CLPP substrates. In general methyl gallate increased the respiration of the CLPP substrates suggesting that methyl gallate selected for some populations, while the larger polyphenols decreased respiration, suggesting selection against populations. PCR analysis suggested that none of the polyphenols had a major impact on either AOB or AOA.


top
A49:
Recombination Events and Hotspots: Analysis of their relation to genetic events

Subject: Population Genomics

Presenting Author: Robert Shields, Case Western Reserve University

Author(s):
Sunah Song, Case Western Reserve University
Jie Zheng, Nanyang Techonological University
Jing Li, Case Western Reserve University

Abstract:
The study of recombination events and rates in human populations can provide insight into genetic function. Charactering these variations is an essential step to understand the meiotic recombination mechanism of humans. We have generated recombination events from a high resolution (500k) SNP data set of family data. We compared our findings to previous studies and also to the HapMap2 dataset. We have found that higher resolution data sets allows for a much more precise call of recombination events. We have shown that our results are consistent with previous work, yet providing more precision. Further, we have shown a significant overlap between deletion events from the 1000 genome project and our female recombination events. This result opens questions regarding how these two types of events can affect one another. Finally, we show that the male and female recombination events overlap three motifs in a significant manner. In particular, we found that both male and female recombination events were located on motif binding sites for PRDM9, SP1 and REST. Finally, we have reported a list of possible candidate trans-regulators of male and female recombination events.


top
A50:
Analyses of Cancer Driver Gene Signaling Pathway Networks Using Within-Species Network Alignments

Subject: Protein Interactions & Molecular Networks

Presenting Author: Gurkan Bebek, Case Western Reserve University

Author(s):
George Linderman, Case Western Reserve University
Mehmet Koyuturk, Case Western Reserve University
Mark Chance, Case Western Reserve University

Abstract:
Recent advances in “-omics” techniques have led to the discovery of cancer driver genes (CAN- genes) partaking in carcinogenesis when mutated. However, the majority of cancer patients do not actually exhibit mutations in all of these genes. Based on this observation, we hypothesize that it is rather the driver genes' synergistic activity that is leading carcinogenesis, as opposed to a mutation in a single CAN-gene. We also hypothesize that different combinations of mutations would lead to similar carcinogenic phenotypes. Hence, we have applied a network approach to investigate CAN-genes, in order to identify the relationships between these genes and reveal common patterns of functional relationships that underlie similar phenotypic outcomes.
We present a network-alignment based approach to understand the relationships between CAN-genes. Using colorectal cancer as a specific application, we utilize a novel within-species network alignment algorithm. Next, we perform hierarchal clustering of the alignment results to group CAN-genes by the similarity of their associated networks. Finally, we integrate these clusters with independently observed somatic mutations across 94 patients, and find that mutations of CAN-genes in highly similar subnetworks are generally mutually exclusive (Mantel test; p=0.4).
This result shows that throughout the carcinogenic process where multiple mutations are observed, mutations in only one of the synergistically similar CAN-genes may be sufficient to progress tumor growth in colorectal cancer This validates our framework as an effective network-based approach to understanding the relationships between CAN-genes. Using this framework, we will improve our understanding of the tumorigenesis timeline and further improve diagnostic approaches.


top
A51:
NodeFilter: A Cytoscape Plugin for Efficient Network Exploration

Subject: Protein Interactions & Molecular Networks

Presenting Author: Gang Su, University of Michigan

Author(s):
Manhong Dai, University of Michigan, Ann Arbor
Brian Athey, University of Michigan
Barbara Mirel, University of Michigan
Fan Meng, University of Michigan

Abstract:
Cytoscape is one of the most popular software for network visual exploration. With large and complex networks, a very frequent task is to investigate the neighborhood of some functional nodes and nodes sharing some common properties and connections.. This requires convenient methods to traverse the network and hide irrelevant information from the user. Cytoscape has provided user search-abilities to identify nodes, but no easy way to modify node view based on neighborhood proximity. Furthermore, external data (i.e., not part of the displayed Cytoscape network) describing nodes relationships are often needed to facilitate novel biological knowledge discovery. Some plugins for Cytoscape, such as MiMI, has been developed for address part of needs for using external information. We have developed NodeFilter plugin specifically for 1) interactively assist the user to grow and shrink a network from pivot nodes of interest, using local or remote data. The user could hide irrelevant nodes from the view to reduce information complexity and visually explore regions of interest. Nodes can be queried and shown/hidden based on the distance to the pivot nodes. 2) We incorporated efficient path discovery function based on pairwise entity relationships to enrich the user network . We have consolidated data from various sources such as MiMI, Gene-GO, Microarray experiment and Natural Language Processing (NLP). We believe with other plugins such as GLay (for network partitioning) and GSearcher (network attribute query), NodeFilter will significantly enhance the network visual exploration workflow.


top
A52:
An interactome approach for genomic analysis

Subject: Protein Interactions & Molecular Networks

Presenting Author: Kaiyu Shen, Ohio University

Author(s):
Razvan Bunescu, Ohio University
Sarah Wyatt, Ohio University

Abstract:
The transcriptome data (microarray, RNA-seq, etc.) provides with valuable genome-wide data. However, it also incurs some problems with only identifying very few “statistical significant” ones from tens of thousands of genes while losing putative valuable information. Thus, we used a systematic approach to build an interactome as an alternative way of analyzing the data. Here we used sample data on gravitropic signal transduction in Arabidopsis inflorescence stems. We first conducted a microarray experiment to profile this event. Then we collected a series of biological features, representing different attributes of each gene. Known PPI interactions were extracted from five public databases. Pubmed IDs of each related gene were retrieved and a natural language processing method was applied to study each sentence by assigning weights of each gene pair, which could identify true positive meaningful co-current pairs. GO/AraCyc and protein domain information were also mapped to each gene. Last we have extracted the expression profiles of each gene from TAIR and NASC databases and calculated the pair-wise correlation (absolute correlation and partial correlation). After manipulated and prepared the data into desired formats, We tried two different maching learning algorithms, label propagation and SVM, to study these features based on a benchmark interactome. The results showed our results have increased the prediction and precision recall ratio comparing to only use the available protein-protein interactions information. This method thus can serve as a template for identifying the interactome of any species


top
A53:
Evaluating the effects of clustering methods in co-expression-based functional inference in Arabidopsis thaliana

Subject: Protein Interactions & Molecular Networks

Presenting Author: Sahra Uygun, Michigan State University

Abstract:
Gene co-expression analysis has been widely used for hypothesizing gene functional relations. This is because genes with similar expression patterns are more likely to have similar regulatory mechanisms and, therefore, similar functions. In the model plant Arabidopsis thaliana genome, there are still genes that have no experimentally verified functions. Thus functional inference based on co-expression can be particularly useful for predicting unknown gene functions. In this study, our first goal is to assess how expression data clustering methods as well as input data may influence functional inference of A. thaliana genes based on co-expression. In addition, we would like to address the question of whether functions of genes in certain biological processes may be better predicted with the expression data. Different variables in clustering analysis are considered including input expression data, clustering algorithm, distance measure, number of clusters, and functional classification datasets for evaluating the performance in co-expression-based functional predictions. Publicly available A. thaliana stress expression data is used in the analyses, along with the mostly used partitioning and hierarchical clustering algorithms. During the meeting, we will present how different variable combinations for clustering influence functional prediction accuracy. In addition, we will report whether the prediction performance differ between genes in different functional categories. This work will be useful for evaluating if expression data clustering methods can be optimized for functional prediction and for maximizing the benefits in obtaining biologically meaningful information from gene expression.


top
A54:
Improved Search Strategies for Fitting Rate Parameters to Viral Assembly Models

Subject: Protein Interactions & Molecular Networks

Presenting Author: Lu Xie, Carnegie Mellon University

Author(s):
Gregory Smith, Carnegie Mellon University
Xian Feng, Carnegie Mellon University
Russell Schwartz, Carnegie Mellon University

Abstract:
Viral capsid assembly has been a topic of research for researchers from various disciplines of biology, computer science, physics and mathematics due to its value as a model of complex self-assembly in general as well as its medical importance. Theoretical and simulation studies have played a key role in this work. While such efforts have been valuable in analyzing spaces of theoretically possible pathways or assembly methods, they have so far been limited in their ability to make predictions about specific viruses because of the difficulty of determining detailed interaction parameters needed to instantiate simulations or theoretical models. In prior work, we addressed this question by using data fitting methods to learn reaction rate parameters for individual protein-protein interactions from the results of light scattering experiments that track bulk assembly progress of in vitro capsid models. In the present work, we have sought to improve on these prior approaches by a variety of strategies for reducing the degrees of freedom of the parameter space in order to allow more precise fits. Specific search strategies include novel approaches for grouping parameters and simultaneous fitting of data from multiple experimental conditions to a single physical model. We demonstrate our methods on three capsid systems – human papillomavirus (HPV), hepatitis B virus (HBV), and cowpea chloritic mottle virus (CCMV) – with the resulting fits suggesting a diversity of assembly mechanisms.


top
A55:
Bacterial APOA-1 Purification for the Deliverance of Anticancer Drugs

Subject: Protein Interactions & Molecular Networks

Presenting Author: Thurman Young, Langston University

Abstract:
Background: Although chemotherapy regimens have proved effective in attacking cancer cells and tumors, side effects and drug resistance remain a major concern during cancer therapy. The use of reconstituted high density lipoprotein (rHDL) nanoparticles has been investigated as a drug delivery system, including the transporting of small interfering ribonucleic acid (siRNA). The use of rHDL nanoparticles has great potential, in this regard, due to their ability to specifically target cancer cells via the HDL (SR-B1) receptor.
Objective: The goal of these studies was to continue our previous studies to further improve the purification protocol of Apolipoproteins A-1 (ApoA-1), a major component of rHDL, and the preliminary characterization of the siRNA carrying rHDL nanoparticles
Methods: E.coli cells transfected with the apo A-I gene was grown at 37OC, until an optical density of 0.6 was reached. The cells were then induced with 0.5mM Isopropyl β-D-1-thiogalactopyranoside (IPTG) and centrifuged. Subsequently, the pellets were suspended in the lysate solution and loaded onto a Nickel-Sepharose column. Thereafter, rHDL nanoparticles using siRNA were prepared.
Results: 190 mg per liter of purified ApoA1 was obtained. Particle measurements improved from 198nm to an average 90nm and the chemical composition of the particles are continuing to be investigated.


top
A56:
SEA for detecting activated biological subpathways

Subject: Protein Interactions & Molecular Networks

Presenting Author: Thair Judeh, Wayne State University

Author(s):
Dongxiao Zhu, Wayne State University

Abstract:
Motivation: With the widespread adoption of microarrays and next generation sequencing technologies, the rate of data accumulation has far outpaced the rate of data analysis. In order to cope with the ever increasing amounts of data, many different frameworks have been proposed to assist researchers in extracting meaningful biological insights from their high-dimensional molecular profile data. Since most current frameworks focus on unstructured gene sets, novel frameworks that focus on pathway structures in addition to pathway gene sets are able to provide a more refined analysis of the underlying biological conditions at hand.

Results: Structure Enrichment Analysis (SEA) is a novel framework for detecting biologically relevant structured gene sets. Using an in-house developed graph traversal algorithm, SEA extracts root-to-leaf linear paths from the KEGG pathways that we hypothesize accurately models signaling cascade transductions. SEA also tweaks the Clique Percolation Method in order to extract nonlinear subpathways as well. To rank these subpathways, SEA’s use of the Bayesian Information Criterion allows it to focus on the activated subpathways within a single condition or the conserved subpathways across multiple relevant conditions. Users can then click on results in the SEA Graphical User Interface to view highlighted subpathways within their respective parent pathways. SEA thus helps complement the efforts to extract meaningful biological insights from high-dimensional molecular profile data.


top
A57:
Mutual Information Analysis of Adenylate Kinase

Subject: Protein Structure & Function

Presenting Author: Nicholas Callahan, The Ohio State University

Author(s):
Deepa Perera, Muskingum College
Venuka Durani, The Ohio State University
Brandon Sullivan, The Ohio State University
Deepti Mathur, Cornell University
William Ray, The Ohio State University
Thomas Magliery, The Ohio State University
Jendy Weppler, Muskingum College

Abstract:
Adenylate Kinase (ADK) catalyzes the conversion of adenosine triphosphate and
monophosphate (ATP + AMP) into two adenosine diphosphates (ADP), creating a significant impact on both energy homeostasis and intracellular signaling. Three domains comprise ADK:
the core domain, the AMP binding domain, and the LID domain. It has long been known that the LID domain closes over the active site upon substrate binding and that this action is essential to catalytic activity. Studies on the LID domains of ADK from gram-positive and gram-negative bacteria show that the former incorporates a zinc-chelating motif, while the latter has in its place
a network of hydrogen bonds and hydrophobic contacts. Four residues that form the gram-positive chelating motif were observed by Dr. Will Ray to have both a high statistical correlation
to each other and to act as the center piece of a distinct correlation motif in the StickWRLD imaging modality. We further elaborated on the relationship of these four positions by mutual information analysis, which uses the difference between the predicted and observed entropy of combined sets to describe the information one position can relate about another. When the four
chelating motif residues in B. subtilis ADK are mutated to E. coli identities, the enzyme becomes catalytically inactive. We demonstrate that using mutual information analysis, we identify
two mutations which rescue catalytic activity without affecting protein stability or folding.


top
A58:
The Subtle yet Significant Effects of Amino Acid Correlations on Protein Structure

Subject: Protein Structure & Function

Presenting Author: Venuka Durani, The Ohio State University

Author(s):
Brandon Sullivan, The Ohio State University
Thomas Magliery, The Ohio State University

Abstract:
Multiple sequence alignments (MSAs) of proteins have been analyzed at the levels of amino acid conservation and correlation to understand and engineer proteins. While sequence conservation has been used to predict important residues and stabilizing mutations, various studies have suggested that complementing the data with correlation analysis can show better results. A complete understanding of the information encoded at the levels of consensus and correlation has not been achieved yet, and experimental data for such studies on relatively large globular enzymes is scarce. In a previous study, two consensus variants of the protein triosephosphate isomerase (TIM) were made and although they possessed the basic TIM scaffold and function, they were not completely native-like. In this study, we have carried out extensive correlation analysis of the TIM MSA leading to the discovery of a co-evolving network of residues. Strongly correlated residues of the network had a phylogenetic bias and belonged to metazoan, eukaryotic or bacterial sub-networks. Three consensus variants of TIM were designed by taking sequences from each one of these sub-networks to capture the correlations corresponding to each of them. Kinetic and biophysical characterization of these variants showed that they were more native-like than previously constructed consensus variants while also being thermostabilized. We also found several broken correlations in the previously constructed consensus mutants that could explain their non-native behavior. This shows that while consensus information is enough to capture the overall fold of a protein, the fine-tuning of the native structure is achieved by correlations.


top
A59:
Biomolecular Motors and Switches: From Machines to Drugs

Subject: Protein Structure & Function

Presenting Author: Barry Grant, University of Michigan

Author(s):
Guido Scarabelli, University of Michigan

Abstract:
Molecular motors and switches lie at the heart of key biological processes, from the division and growth of cells to the muscular movement of organisms. Our approach to studying these fascinating nanomachines couples bioinformatics (to probe sequence-structure-function relationships); molecular dynamics (to investigate essential conformational changes); Brownian dynamics (for diffusional protein-protein encounters); and computer-aided drug design (for discovering novel therapeutics). I will describe two discoveries that exemplify the power of this approach. First, how it uncovered the importance of electrostatics in the motion of kinesin motors, and how this information enabled the rational design of mutant motors with tailored velocities. Second, how it revealed that the traditional "induced fit” view for activating conformational changes in molecular switches should be replaced by a “conformational selection” model, and how this framework led to the discovery of novel small molecule Ras inhibitors.

Images and animations related to this work can be found at: http://thegrantlab.org/


top
A60:
Comparative structure and molecular dynamics of TAR RNA-binding protein (TRBP); A keyplayer in RNA interference pathway

Subject: Protein Structure & Function

Presenting Author: Munishikha Kalia, University of Luebeck

Author(s):
Sarah Willkomm, University of Luebeck
Jens-Christian Claussen, University of Luebeck
Tobias Restle, University of Luebeck

Abstract:
The TAR RNA-binding protein (TRBP) is one of the key players in RNA interference (RNAi) pathway. The protein-protein interactions between TRBP and Dicer facilitate loading of small RNAs into the RISC complex. Although the details of this reaction are not fully understood, it is evident that at least one out of three TRPB double-strand RNA binding domains (dsRBD) plays a crucial role. The crystal structure for the first dsRBD domain and a solution structure for the second dsRBD domain of the human TRBP reveal α-β-β-β-α fold. Thus far, no structural information is available for the third dsRBD along with the linker regions, which likely plays a vital role in binding to Dicer. The present study focuses on the structure prediction of the full-length TRBP by comparative modelling. Low homology of available sequences was overcome by using multiple templates with sequence identities ranging from 21-34%. The tertiary structure of TRBP was predicted by Modeller9v10 and the one with the lowest Discrete Optimized Protein Energy (DOPE) and Molprobity score was selected. PROCHECK analysis was run for structure assessment and 98.4% of the residues were present in the allowed regions of the Ramachandran plot. The molecular dynamics simulation over a period of 75 nanoseconds reveals an initial exposed and later tethered conformation. The structural flexibility of the third dsRBD domain confirms its role in the interaction with Dicer. The current findings provide first insights into the functionally important TRBP/Dicer interaction and might help to unravel some of the mechanistic details of RNAi


top
A61:
Comparative Analysis of the Adenylate-Forming Superfamily

Subject: Protein Structure & Function

Presenting Author: Dani Leatherby, Franciscan University of Steubenville

Author(s):
John Perozich, Franciscan University of Steubenville

Abstract:
A superfamily of adenylate-forming enzymes (LuxE) includes aryl- and acyl-CoA synthetases, the adenylation domain of non-ribosomal peptide synthetases and luciferases. These enzymes perform a variety of roles, including fatty acid metabolism, detoxification of halogenated aromatic xenobiotics, antibiotic synthesis and bioluminescence. This undergraduate research project sought to compare 261 amino acid sequences and their structural, functional, and phylogenetic similarities. Seven residues were fully conserved among these sequences. An invariant glutamate coordinates the Mg2+ ion, along with a well-conserved threonine or serine. An invariant aspartate coordinates the adenosine ribose. An invariant arginine participates in a salt bridge network and also assists in the coordination of the adenosine ribose. An invariant lysine interacts with the β-phosphate of ATP in the adenylate-forming conformation. A highly conserved P-loop forms the phosphate-binding site. The linker motif joins the N-terminal and C-terminal domains. The C-domain can move to form two different functional conformations, the adenylate-forming and the thioester-forming conformations. Pattern analysis identified the ten most well-conserved motifs in the superfamily. Phylogenetic analysis reveals distinct groupings for each family of enzymes, with luciferases sharing closest homology to long-chain fatty acyl-CoA synthetases.


top
A62:
Structure and Sequence analysis of AT1, AT2, and MAS in binding and activation by Angiotensin molecules.

Subject: Protein Structure & Function

Presenting Author: Jeremy Prokop, The University of Akron

Abstract:
The renin-angiotensin system is a component of diseases from cardiovascular to cancer. Its pathways are target for treatments including ACE inhibitors, renin inhibitors and AT1 blockers. However, very little is understood about the various G-protein coupled receptors that are activated by angiotensin peptides. This study addresses three known receptors of the pathway; AT1, AT2, and MAS. Combining biochemical and amino acid variation data with multiple species sequence alignments, structural models of each, and docking site predictions allows for visualization of how angiotensin peptides may bind and activate the three receptors, addressing conserved and variant mechanism. This study reveals that MAS differs in binding of Angiotensin peptides, favoring a binding to Ang-(1-7) and not Ang II. MAS related proteins shown to be activated by Ang peptides reveal possible amino acids that may contribute to homo or heterodimer formations with other membrane bound proteins. Finally a new model of angiotensin binding to AT1 and AT2 is proposed that correlates data from site directed mutagenesis and that of photolabled experiments that have been previously considered conflicting data. This works through a conserved initial binding mode and then propagation of amino acid 8 (Phe) of Ang II through conserved aromatic amino acids to the final photolabled positioning relative to either AT1 or AT2. This study serves to identify numerous future experiments to be performed to allow for a clear understanding of the angiotensin peptide receptors at a molecular level which could serve as a valuable tool in drug design and discovery.


top
A63:
Contact Geometry of Estrogen Receptor Dynamics

Subject: Protein Structure & Function

Presenting Author: Yosi Shibberu, Rose-Hulman Institute of Technology

Author(s):
Mark Brandt, Rose-Hulman Institute of Technology

Abstract:
The estrogen receptor is a biologically important protein with crucial roles in normal physiology and in the development and growth of a variety of cancers, with breast cancer being the most common. The estrogen receptor is also an interesting protein from a protein structure-function perspective, in that its function requires conformational changes mediated by the binding of physiological and pharmaceutical ligands to the ligand-binding domain (LBD) of the protein. Because the nature of the conformational changes for this protein are poorly understood, we have used the estrogen receptor LBD as a model system for assessing structural alterations obtained in molecular dynamics simulations. Molecular dynamics simulations were performed using the 1QKU estradiol bound structure with estradiol removed, and a ligand-free homology model based on the 1lbd RXR-alpha structure. We use a measure we developed previously, mean-contact-deviation (MCD), to compare the dynamics of the ligand binding pocket of the 1QKU estradiol bound structure (with estradiol removed) to the ligand binding pocket of the ligand-free homology model. MCD generalizes the widely used root-mean-squared-deviation (RMSD) measure from three dimensions to n-dimensions, where n in the current study equals the number of atoms in the ligand binding pocket. Comparisons based on MCD indicate that the ligand binding pocket is much more dynamic in the 1QKU estradiol bound structure (with estradiol removed) than in the ligand-free homology model. Comparisions based on RMSD do not indicate any significant differences.


top
A64:
Inhibition of Class D β-Lactamases by Carbapenems Depends on Substrate Conformation: Effects of Moieties Distal to the β-Lactam Ring

Subject: Protein Structure & Function

Presenting Author: Agnieszka Szarecka, Grand Valley State University

Author(s):
Kelsey Perry, Grand Valley State University
Troy Wymore, Pittsburgh Supercomputing Center

Abstract:
Bacterial resistance to β-lactam antibiotics is a serious clinical concern. Major resistance mechanism relies on four classes of periplasmic hydrolazes: β-lactamases. In classes A, C, D hydrolysis of the β-lactam ring proceeds in two steps: acylation and deacylation of a conserved nucleophilic Ser. Class D has the most diverse catalytic profiles and harbors several members able to hydrolyze the newest line of β-lactams - carbapenems. The challenge is to determine the factors governing the ability or inability of a given class D member to hydrolyze carbapenems. In this study, we focused on two class D enzymes: OXA-1 (inhibited by carbapenems via arrested deacylation) and OXA-24 (active carbapenemase) and compared their different modes of carbapenem substrate binding. We employed electronic structure calculations to determine the conformational and tautomeric equlibria in doripenem pyrroline ring depending on 1) the enzyme-specific binding mode, and 2) rotational flexibility of two moieties that are characteristic of the carbapenem line: the 6’-hydroxy-ethyl and the thio-pyrrolidine. We found that Δ2 tautomers (observed in OXA-24) are consistently lower in energy but the OXA-1 binding pocket forces conformational changes in the doripenem that facilitate Δ2- Δ1 transition. The 6’-hydroxy-ethyl rotamer observed in OXA24 also favors Δ2 tautomer. Our results show that antibiotic hydrolysis, or inhibition, is affected by subtle conformational changes in the entire doripenem substrate and induced by protein binding surface.


top
A65:
From Molecular Phylogenetics to Microsecond Molecular Dynamics Simulations: The Effect of an Allosteric Disulfide Bond in a Diverse Clade of Class D Beta-Lactamases on the Stability of the Active Site

Subject: Protein Structure & Function

Presenting Author: Troy Wymore, Pittsburgh Supercomputing Center

Author(s):
Nikolay Simakov, Pittsburgh Supercomputing Center
Agnieszka Szarecka, Grand Valley State University

Abstract:
Bacterial resistance to antibiotics is often facilitated through the action of beta-lactamases and is a critically important health challenge. Thus understanding the evolution of these enzymes with regards to the binding and hydrolysis of antibiotics can inform and guide the process of developing new generations of these drugs. Here, we present our bioinformatic analyses on class-D beta-lactamases (oxacillinases), which is one of only two enzyme families that employ a carboxylated lysine as part of their catalytic machinery. A refined multiple sequence alignment of over 80 sequences was constructed using MEME and 3-dimensional structure as a guide. Phylogenetic analysis revealed several distinct groups of class D sequences enabling the identification of group-specific residues by the Group Entropy program (shannon.psc.edu/harvest). Of particular interest is the discovery of two conserved cysteines within a large, very diverse group of class D sequences derived from α-, β-, γ-proteobacteria. Crystal structures of OXA-1 (a representative member of the group) show the cysteine pair either in a reduced state (PDB entry 3SIG) or possibly in a mixture of reduced and oxidized states (PDB entry 1M6K). Long timescale MD simulations of the solvated OXA-1 structure shows a dramatic change in the active site of the reduced form while in the oxidized form the active site remains stable over the trajectory. These results demonstrate that stabilization of this section of the enzyme offers some fitness advantage and suggest strategies for new antibacterial therapeutics.


top
A66:
Inference of recurrent 3D RNA motifs from sequence

Subject: Protein Structure & Function

Presenting Author: Craig Zirbel, Bowling Green State University

Author(s):
Anton Petrov, Bowling Green State University
James Roll, Bowling Green State University
Neocles Leontis, Bowling Green State University

Abstract:
Correct prediction of RNA structure from sequence is an unsolved problem in bioinformatics. An important sub-goal is the inference of the 3D structures of recurrent hairpin and internal loops. Such motifs can play architectural roles, serve to anchor RNA tertiary interactions, or provide binding sites for proteins. To establish the sequence variation of recurrent motifs, all hairpin and internal loops from a non-redundant (NR) set of RNA 3D structures are extracted and clustered in geometrically similar families. Probabilistic models for sequence variability are constructed for each motif using hybrid Stochastic Context-Free Grammar/Markov Random Field (SCFG/MRF) models and parameterized by all motif instances and knowledge of substitution patterns for non-Watson-Crick basepairs. SCFG techniques can account for nested pairs and insertions, while MRF ideas can handle non-nested interactions, including base triples. Given the sequence of a hairpin or internal loop from a secondary structure as input, each SCFG/MRF model calculates the probability that that sequence variant would occur. If the score is in the same range as sequences known to form the 3D structure, we infer that the new sequence forms the same 3D structure. This approach correctly infers the 3D structures of nearly all structured internal loops when using sequences from 3D structures as input. Often, a single sequence is enough to correctly infer 3D structure. Probabilistic models for 3D motifs from structurally conserved regions of ribosomal rRNA were validated by scoring sequence variants from multiple sequence alignments that are different from those used to construct the models.


top
A67:
Genome-wide Profiling of Transcription Factor NFAT5 Binding Sites in Response to Hypertonicity: A Systems Approach

Subject: Gene Regulation & Transcriptomics

Presenting Author: Taruna Singh, National Institute of Health

Author(s):
Yuichiro Izumi, National Institutes of Health
Danni Yu, National Institutes of Health
Joan D Ferraris, National Institutes of Health
Maurice B. Burg , National Institute of Health

Abstract:
Next-generation sequencing (NGS) technologies have unprecedented speed and provide a more cost-effective, comprehensive picture of the genome than previously established Sanger-sequencing.In this study, we apply a systems approach using various NGS methodologies to identify genes targeted by osmoprotective transcription factor Nuclear Factor of Activated T-Cells 5 (NFAT5 or TonEBP or OBP). NFAT5 activates its downstream genes in response to elevated NaCl or hypertonicity. Targets include aldose reductase (AR) and sodium-chloride-betaine cotransporter (SLC6A12) each of which generate intracellular organic osmolytes that protect renal medullary cells from their normal milieu of high and variable NaCl. We used chromatin-immunoprecipitation with high-throughput sequencing (ChIP-Seq) to identify potential target genes of NFAT5 in wildtype mouse embryonic fibroblasts (MEF) cells at three time points (1hr, 24hr, and adapted) and at two osmolalities (300mOsm, 500mOsm with added NaCl). NFAT5 null MEF cells which express a truncated NFAT5 that lacks the DNA binding domain served as control. We also performed RNA-Seq, using the same cells and conditions, to measure changes in expression of direct and indirect targets of NFAT5. We compared our RNA-Seq data to our ChIP-Seq data, using a differential expression algorithm with fold-change and p-value <0.05 as quantifiers. We found at least two hundred ChIP-identified genes were up-regulated in the hypertonic condition and that ten percent of these correlated with significant changes in RNA-expression, including AR and SLC6A12. To overcome a limitation of ChIP-Seq and RNA-Seq technologies, we recently completed TSS-Seq on the same dataset to determine the transcription start site of NFAT5 target genes.


top
A68:
Computational Systems Biology Analysis of Changes in Gene Regulation during Flavivirus Infection

Subject: Gene Regulation & Transcriptomics

Presenting Author: Minming Li, Purdue University

Abstract:
Flavivirus-causing diseases are causing thousands of deaths and threating millions of people in the world each year. Despite widespread studies, the underlying regulatory mechanisms in the flavivirus infected host cells are poorly understood. Microarrays, a popular gene expression profiling technique, have been widely used to detect differentially expressed genes in the study of these diseases. By using computational systems biology methods, this study will focus on analyzing the high-throughput data, including gene expression data and transcription factor binding data from published literature and curated databases. The goal of the study will be to discover a common differentially expression profile in host cells during the infection of the different flaviviruses, and report the regulatory elements including the differentially expressed genes and transcription factors. The study will provide a clear conclusion about what genes are differentially expressed, what pathways are enriched, and what transcription factors are conserved during infection by the representative flaviviruses.


top
A69:
The distribution and abundance of insertion sequences in Escherichia coli: a phylogenetic perspective

Subject: Evolution & Comparative Genomics

Presenting Author: Ethan Knapp, University of Akron

Abstract:
The activity of mobile genetic elements presents a potent mutagenic force that acts across all domains of life. As a result, mobile elements increase variation within populations that harbor them. Between populations, differences in the distribution of mobile elements may lead to divergence in evolutionary rates. This variation in tempo can have a powerful impact on evolutionary trajectory of a species. Within bacterial lineages there are significant differences in mobile element constitution, both in element copy number and the types of elements that are present. In addition, these factors can change fairly rapidly within strains of bacteria. Most queries into the distribution of mobile elements in bacteria have been focused on whether a particular element is found and the frequency at which it appears. Although this approach may contribute to comparative analyses between species, little can be said about patterns of variation that are found within a species. To address this issue, the distributions of known insertion sequences (IS) were characterized among strains of E. coli recently derived from natural sources. Both the total copy number and the types of IS elements were then compared to the bacterial phylogeny to illustrate how IS elements are distributed across E. coli.


top
A70:
Elucidating Functions of Universal Stress Proteins in Brucellae

Subject: Protein Structure & Function

Presenting Author: Dominique McInnis, Jackson State University

Author(s):
Shaneka Simmons, Jackson State University
Natasha Amos, Jackson State University
Andreas Mbah, Jackson State University
Wellington Ayensu, Jackson State University
Raphael Isokpehi, Jackson State University

Abstract:
Brucella is a Gram-negative, non-motile, non-encapsulated, facultative intracellular coccobacillus, which causes brucellosis in humans and various animal species. Human brucellosis is a very common zoonotic disease worldwide with more than 500,000 new cases reported annually. In the United States only about 100 to 200 cases are reported annually as declared by the Center for Disease Control and Prevention (CDC). Gene encoding proteins with the universal stress protein (USP) domain are known to provide cells with the ability to respond to various environmental stresses such as nutrient starvation, high salinity, extreme temperatures, drought and exposure to toxic chemicals. We hypothesize that the universal stress proteins function in response and adaptation to the hostile intracellular environments during host infections. A total of 125 Brucella genes encoding universal stress proteins (USP) were obtained from the Integrated Microbial Genomes system with 30 genomes having 4 USP each. We observed a 147aa universal stress protein unique to strain 83/13. The genome of Brucella sp. 83/13 had 5 USP genes. The amino acid length ranged from 101aa to 281aa. Amino acid sequences with only one USP domain and those with two USP domains were observed. The ligand-binding residues for each of the Brucellae USPs were also predicted to gain insights into their regulation. In conclusion, we have prioritized the USP genes for further research including understanding their structure and function.


top
A71:
Investigating DNA Structural Properties as Gene Regulatory Signals in Plasmodium

Subject: Gene Regulation & Transcriptomics

Presenting Author: Bryan Quach, Loyola University Chicago

Author(s):
Catherine Putonti, Loyola University Chicago

Abstract:
The genus Plasmodium contains the parasitic species responsible for causing over 200 million cases of malaria each year. Despite research efforts to elucidate the mechanisms of transcriptional regulation in Plasmodium species, few DNA regulatory elements in the parasite's genome have been characterized. Several algorithms exist for motif finding that implement methods such as alignment, position weight matrices, and markov models. These methods work effectively for organisms with an AT genome content between 40% and 70%, but the ~80% AT-biased genome of Plasmodium makes current motif discovery algorithms ill-suited for analyzing Plasmodium. Due to the inability of current computational methods to accurately identify regulatory motifs in Plasmodium, we are developing an algorithm to find these elements in promoter sequences of Plasmodium genes. Herein we discuss the challenges of motif detection in Plasmodium and the potential for structural properties such as DNA curvature and bendability to aid in the discovery of DNA regulatory elements. Knowledge of cis-regulatory elements in Plasmodium would contribute to a deeper understanding of the parasite biology and could potentially lead to the discovery of new targets for controlling Malaria.


top

Section B

B01:
Tbl2KnownGene: a parser for converting NCBI .tbl to UCSC knownGene.txt file

Subject: Algorithm Development and Machine Learning

Presenting Author: Yongsheng Bai, University of Michigan

Author(s):
Richard McEachin, University of Michigan

Abstract:
Exome sequencing technology is being employed to identify SNPs and/or INDELs in genetic disease research. The schema for UCSC Genes (knownGene.txt) has been widely adopted for use in both standard and custom downstream analysis tools/scripts. For many popular model organisms (e.g. Arabidopsis), sequence and annotation data tables (including “knownGene.txt”) have not yet been made available to the public. We present Tbl2KnownGene, a .tbl file parser that can process the contents of a NCBI .tbl file and produce a UCSC Known Genes annotation feature table. The .tbl file is a 5-column tab-delimited feature table containing location and key information for records (gene, CDS, mRNA). Our parser/algorithm first classifies records into “blocks”. Each block’s contents are processed separately. The algorithm designates the leftmost start coordinate (rightmost start coordinate for “-”) annotated for exons as the record start and the rightmost end coordinate (leftmost end coordinate for “-”) as the record end. Our algorithm concatenates all exon start locations for a transcript into a single comma-separated list, and likewise all exon ends in a comma-separated list to comply with the UCSC knownGene schema format. The algorithm determines a gene’s strand by comparing the record’s start and end values. Since UCSC knownGene.txt table always reports the exon coordinates from low to high order, our algorithm reverses the order of the exon coordinates for genes coded on the negative strand. We have tested our algorithm with the data sets from the Arabidopsis genome (TAIR10). Our parser is applicable to other organisms with similar .tbl annotations.


top
B02:
Joint Analysis of Omic Data Using an Iterative Splitting Random Forest (SRF) Algorithm

Subject: Algorithm Development and Machine Learning

Presenting Author: Xiaowei Guan, case western reserve university

Author(s):
Mark Chance, case western reserve university
Jill Barnholtz-Sloan, case western reserve university

Abstract:
Existing statistical approaches have known limitations when integrating high dimensional omic data simultaneously. To overcome these limitations, we developed a novel pipeline by embedding subtype discovery techniques and multivariate methods into the feature selection process. Here, we used iCluster, an algorithm that identifies latent subtypes by jointly clustering on matched gene expression and methylation omic datasets with Spearman correlation filtering. Once the latent subtypes were identified, we applied a modified statistical, non-parametric algorithm, Splitting Random Forest (SRF), to the first principal components as the summary values of the reduced joint omic datasets. The SRF algorithm implements a random splitting test-train technique into the standard RF algorithm, allowing for identification of a small set of genes that distinguish between cancer subtypes while preserving high prediction accuracy. To test this pipeline, we used publically available glioblastoma (GBM) epigenomic and transcriptomic data from the Cancer Genome Atlas (TCGA). We showed that the SRF method outperformed standard statistical and bioinformatic methods for identification of latent subtypes and dominant genes that distinguish between those subtypes. Two optimal runs of SRF with 21 genes in each were identified with a prediction accuracy of 0.985 and 12 (57.1%) genes in common. Our results show that the combination of iCluster and SRF algorithms is powerful to identify novel latent subtypes and dominant biomarkers of cancer. By taking into account the two types of omic datasets simultaneously, this pipeline offers novel insights into cellular mechanisms and personalized medicine for cancer.


top
B03:
CellOrganizer: Integrated tools for image-derived models

Subject: Algorithm Development and Machine Learning

Presenting Author: Ivan Cao-Berg, Carnegie Mellon University

Author(s):
Aabid Shariff, Carnegie Mellon University
Jieyue Li, Carnegie Mellon University
Devin Sullivan, Carnegie Mellon University
Tao Peng, Microsoft
Gustavo Rohde, Carnegie Mellon University
Robert Murphy, Carnegie Mellon University

Abstract:
The architecture of eukaryotic cells is extremely complex, with tens of
thousands of different proteins and other macromolecules organizing
themselves into many dynamic and variable subcellular structures and
organelles. Microscopy, in combination with computational tools to analyze
the resulting images, remains the most important tool for learning
about cell organization. However, the ability to integrate information from diverse
images into a
cohesive, generative framework is largely missing.
We have previously described separate research efforts towards
creating generative models for cell and nuclear shape and the patterns of
proteins found in vesicular organelles and microtubules. Hence we present
the CellOrganizer, a system that integrates our previous approaches two carry out
two main tasks:
(1) learning models from images that are automated, generative, statistically
accurate and compact, and (2) synthesizing new instances of images
from the learned models. Current components of CellOganizer can perform
the latter tasks by learning conditional models of cell shape, nuclear
shape, chromatin texture, vesicular organelle size, shape and position as
well as microtubule distribution and generating instances from these
models. These synthesized images can be output as idealized cells or as
images such as might be acquired on a specific microscope. They can also be used for
cell simulations.
CellOrganizer provides a robust, open-source set of tools that allows a systematic
approach to learning models from different microscope systems and
different experimental conditions for use in combination with other systems biology software tools.


top
B04:
RNA Structure Miner: a CUDA-based high-throughput bioinformatics tool for mining RNA structure

Subject: Algorithm Development and Machine Learning

Presenting Author: Min Dong, Miami University

Author(s):
Guoli Ji, Xiamen University
Q. Quinn Li, Miami University
Chun Liang, Miami University

Abstract:
Whether there are functional RNA secondary structures around mRNA poly(A) sites to assist cleavage and polyadenylation is still open for debates. So far, RNA secondary structure analysis cannot be done in high throughput fashion because of software limitation. We have developed a C++ program, RNA Structure Miner, which takes RNA secondary structure prediction in dot-bracket format as input and detects common structural elements for a given large amount of sequences. Based on the dot-bracket format, our software identifies hairpin, bulge, interior, junction and their combined structures, and characterizes them in terms of the composition of structure elements and free energy level. Our program is based on Compute Unified Device Architecture (CUDA) that takes advantage of graphics processing units (GPUs) for fast and efficient parallel computation. Using more than 10,000 genomics sequences around poly(A) sites in human, with or without the canonical poly(A) signal AAUAAA, we studied the relationship between RNA structures and poly(A) site usage efficiency in the human genome. We found that a hairpin-like structure (i.e., a complex structure containing hairpin, interior loops and bulge loops) is common around the poly(A) sites. The free energy level of this hairpin-like structure is obviously higher than its flanking regions, which facilitates polyadenylation.


top
B05:
Active Feature Acquisition for Protein-Protein Interaction Prediction

Subject: Algorithm Development and Machine Learning

Presenting Author: Madhavi Ganapathiraju, University of Pittsburgh

Author(s):
Mohamed Thahir, University of Pittsburgh
Tarun Sharma, Carnegie Mellon University

Abstract:
Machine learning approaches to predict protein-protein interactions (PPIs) use biological features of proteins to classify whether a protein pair is interacting or not. However, such features are not known for most proteins. Carrying out wet-lab experiments to determine all such unavailable features (‘missing features’) is infeasible as each experiment requires human expertise, time, high-end equipment and other resources. Active feature acquisition (AFA) strategy is being proposed to guide which of these missing features are to be obtained experimentally so as to improve the classifier performance. The AFA strategy has not been used in the domain of PPI prediction. The only approach previously developed for other domains considers every possible combination of instance, feature and feature-value and computes which combination gives best accuracy for that batch, and the approach is not scalable for the domain of PPI prediction. We present a heuristic method that does not require retraining to calculate the utility of acquiring a missing feature. It takes into account the change in belief of the classification model induced by the acquisition of the feature under consideration. Our method achieves the highest possible F-score with as few as 40% missing features acquired compared to random selection of features for acquisition, and is computationally very efficient compared to previous AFA strategies. By analyzing the features acquired by the algorithm, we find that the biological process feature is more relevant than molecular function feature, which in turn is more useful than the subcellular localization feature for predicting protein-protein interactions.


top
B06:
Identification and classification of conserved RNA structural motifs using a graph theoretical approach

Subject: Algorithm Development and Machine Learning

Presenting Author: Jiajie Huang, Purdue University

Author(s):
Kejie Li, Broad Institute

Abstract:
Originally known as a genetic information carrier, RNA also plays a critical role in multiple cellular processes including transcriptional and translational regulation. Known functional RNA classes include transfer RNA, ribosomal RNA, ribonuclease P RNA, small nucleolar RNA, small nuclear RNA, transfer-messenger RNA, and regulatory elements in untranslated regions of messenger RNA. However, the majority of functional RNA motifs are yet to be identified.
Compared to DNA and protein, whose conserved functional motifs can be identified based on underlying sequence similarity, RNA functional motifs lack a reliable signal at the sequence level. However, RNA sequences with similar functions have conserved secondary and higher-order structures. RNA topology, the global organization of local structural elements (stems, loops, pseudoknots, etc), offers an approach for identifying unknown but conserved functional elements.
In this study, we have developed a graph theoretical approach that is able to identify a set of topological features in an RNA graph; this set of features defines a unique structural fingerprint of the RNA molecule. By comparison of RNA structural fingerprints, we can identify conserved structural motifs across RNAs. Such conservation may be indicative of as-yet unknown function. Our preliminary results on four known functional RNA classes exhibited successful identification of specific conserved structural motifs in each class. Further classification using this class-specific motif information reached an accuracy of over 90%. The identification of RNA with similar structural features is a step towards structure-based prediction of RNA function.


top
B07:
Mining PubChem for Factor XIa Inhibitors: The Signature Molecular Descriptor, Support Vector Machines, and Genetic Algorithms

Subject: Algorithm Development and Machine Learning

Presenting Author: Brent Hughes, University of Akron

Author(s):
Donald Visco, University of Akron
Zhong-Hui Duan, University of Akron

Abstract:
The PubChem database offers significant opportunities for data mining related to the biological interactions of small molecules. High-throughput screening (HTS) data available in PubChem offers an avenue for locating preexisting molecules likely to interact with a selected biological target. However, in order to mine the data available in PubChem, it is necessary to build classifiers capable of distinguishing the interactivity of arbitrary molecules with a chosen target. In the present study, we use freely available bioassay data from PubChem to train a classifier to distinguish whether a given molecule will inhibit coagulation Factor XIa. The method presented here brings together three ideas: the Signature molecular descriptor, support vector machines (SVM), and genetic algorithms. By performing feature selection using a custom genetic algorithm on SVM input vectors representing molecular signatures, we were able to create a trivially parallel SVM training regimen capable of rapidly producing classifiers of high (> 90%) accuracy for the stated problem. Furthermore, the method is sufficiently general to permit the construction of classifiers for a variety of assays involving different biological targets. Once crafted, these classifiers can be used to mine PubChem in search of preexisting potentially therapeutic compounds.


top
B08:
A Tool for Homopolymer and Poly(A) tail identification and cleaning in NGS Transciptome Data

Subject: Algorithm Development and Machine Learning

Presenting Author: Abrudan Patricia, Miami University

Author(s):
Jamie Morton, Miami University
Chun Liang, Miami University
John Karro, Miami University

Abstract:
We present a new computational method for the identification and cleaning of poly(A) tails and other homopolymers from Next Generation Sequencing Data for transcriptomes. We can learn much about genomic structure and function through de novo transcriptome assembly: the technique not only provides access to unsequenced genomes, but can produces data useful for studying gene expression dynamics, the annotation of certain genomic feature (e.g. new genes and gene splice sites) and the discovery of novel biological process (e.g. trans-splicing and transcript fusion). But when using such sequences to reconstruct genomic sequences, we must first identify the post-transcriptional insertions such as poly(A) tails and trim them off. However, because of sequencing errors, the accurate identification of homopolymers becomes a challenging bioinformatics task. Here we use a series of simple circular Hidden Markov Models, tailored to specific sequencing technologies including Sanger, 454 and Illumina, to identify and filter such homopolymer sequencing. We have produced human-validated benchmark data sets as the basis for tool quality estimation and evaluation, covering Sanger, 454 and Illumina sequences, and using these benchmark data we show our tool as nearly perfect sensitivity, extremely accurate boundary identification, and is fast enough to process large data sets in a reasonable amount of time.


top
B09:
A multi-sample approach to inferring robust tumor phylogenetic markers

Subject: Algorithm Development and Machine Learning

Presenting Author: Ayshwarya Subramanian, Carnegie Mellon University

Author(s):
Russell Schwartz, Professor/Carnegie Mellon University
Stanley Shackney, Intelligent Oncotherapeutics

Abstract:
Phylogenetic analysis of tumor genome data can provide a way to understand key steps in tumor evolution by delineating the common temporal sequences of aberrations that take place during tumorigenesis. The topology of the resulting phylogenies can help identify tumor subtypes and major progression pathways and suggest possible molecular mechanisms of action. The primary inputs to phylogenetic algorithms in such an approach are molecular markers of tumor progression identified from large-scale tumor genomic data. Thus, tumor phylogeny accuracy and robustness is highly dependent on the quality of these representative molecular markers, which must accurately differentiate between distinct subtypes and progression pathways. Here, we present a novel multi-sample hidden Markov model (HMM) to derive such markers from array-based copy number variation data for building character-based tumor phylogenies. Our algorithm uses an HMM to classify regions of copy number data into normal or aberrated segments, in the process performing a joint segmentation and calling of the samples. Each copy number value is assumed to come from either a diploid or aberrant Gaussian distribution depending on whether the underlying state is called normal or aberrant. The HMM then seeks to explain each copy number probe by a vector of these binary states across samples so as to maximize a simple likelihood model. We demonstrate the performance of the method in comparison to other similar approaches on both simulated and real data and describe its utility in tumor phylogeny inference and other applications.


top
B10:
A Web Interface to Search for Similar Temporal Gene Expression Profiles

Subject: Algorithm Development and Machine Learning

Presenting Author: Olvi Tole, Grand Valley State University

Author(s):
Guenter Tusch, Grand Valley State University

Abstract:
Recent use of microarray technology has led to highly complex datasets often addressing similar biological questions. If the researcher is interested in exploring how temporal patterns discovered in a reference study translate into similar studies, he can select interesting studies from a database, e.g., NCBI GEO.
For our exploratory approach to find similar patterns, we look, for instance, for a peak in the profile instead of correlating the entire profile. This can be accomplished by different statistical or logic-based techniques, e.g., knowledge-based temporal abstraction, where time-stamped data points are transformed into an interval-based representation, or several statistical approaches.
We implemented these ideas by creating a platform SPOT based on open-source software. It supports the R statistical package to convert time series into an interval-based representation. Knowledge representation standards (OWL, SWRL) using the Semantic Web tool Protégé-OWL connect the user through a web interface.
The project is hosted on a Fedora Linux account and utilizes a MySQL database. The user interface is developed in PHP, a scripting language that allows executing R scripts and queries on the MySQL database. The website currently supports the NCBI GEO database and uses a subset of GEOmetadb’s (Yuelin Zhu, 2008) database. The software can “learn” temporal patterns based on the user’s graphical input and logic-based time pattern (in SWRL and Protégé).
The web application support multiple user accounts. It allows the user to interrupt the search
process any time, to later return to the position where she left off.


top
B11:
A probabilistic approach for efficient counting of k-mers

Subject: Algorithm Development and Machine Learning

Presenting Author: Qingpeng Zhang, Michigan State University

Author(s):
Jason Pell, Michigan State University
Rose Canino-Koning, Michigan State University
Adina Howe, Michigan State University
Charles Titus, Michigan State University

Abstract:
K-mer counting has been widely used in many bioinformatics problems, including data preprocessing for de novo assembly, repeat detection, sequencing coverage estimation. However current available tools can not handle the high throughout data generated by next generation sequencing technology efficiently due to high memory requirements or impractically long running time. Here we present the khmer software package for fast and memory efficient counting of k-mers. Unlike previous methods bases on data structures including hash tables, suffix arrays, and trie structures, Khmer uses a simple probabilistic data structure, Bloom counting hash, which is similar in concept to the Bloom filter. It is highly scalable, effective and efficient in applications involving k-mer counting to analyze large NGS dataset, despite with certain false positive rate as tradeoff. We also showed the applications of khmer software in tackling problems like abundance filtering of reads for de Bruijn graph-based assembling and coverage estimation of sequencing effort. The results showed that with much less memory usage and faster speed, khmer software package got comparable result with acceptable accuracy. Some analysis to dataset which is too big to be handled by other available software becomes practical using our khmer software.


top
B12:
Adaptive Acquisition of Spatiotemporal Models of Protein Patterns  

Subject: Bioimaging

Presenting Author: Gregory Johnson, Carnegie Mellon University

Author(s):
Robert Murphy, Carnegie Mellon University
Devin Sullivan, Carnegie Mellon University

Abstract:
Characterizing how cells respond to perturbation is a fundamental task of cell biology, and is critical to the use of image cytometry and high content screening in drug development. Most of those efforts, however, focus on measuring the change in static behavior after a fixed time of exposure. We focus here on automated, efficient methods for measuring changes in dynamic cell behavior as a function of time of exposure. We build on our previous work on efficient building of models of cellular dynamics by implementing the approach on a real microscope and extending it to perturbation kinetics.


top
B13:
Web-based 3D-display of brain structures for data exploration

Subject: Bioimaging

Presenting Author: Gang Yang, University of Michigan

Author(s):
Fan Meng, University of Michigan

Abstract:
The complexity of the structure and function of brain makes text- or even 2D-graphics based exploration of high throughput data sets and literature related to brain ineffective. We hope to build a web-based system to facilitate the understanding of brain structures, circuits, gene expression levels and disease. We choose to use the Adobe Flash Platform for 3D development due to its cross browser support and the newly released Adobe released Flash Player 11 and AIR 3 now support the Stage3D API, which is a set of low-level GPU-accelerated APIs enabling advanced 2D and 3D capabilities. In comparison, while HTML5 is gaining popularity, its features are not fully supported by all browsers and it still does not provide any framework for rapid development.
We build brain 3D structure models using voxel-level data provided by the Allen Brain Institute. The 3D models are converted into the format that Flash Stage3d API can use and they are saved on a BlazeDS server. When requested by users, 3D models will be loaded from server to the Flash player in users’ browser. User can select 3D structures to display together with the corresponding 2D view in x, y, or z directions. We believe this is the first implementation of interactive 3D biological structure display in Flash and we plan to integrate this implementation with gene expression, function annotation and literature data to help the understanding of high throughput neurobiology data in the context of brain structure and functions.


top
B14:
Towards large scale automated interpretation of cytogenetic biodosimetry data

Subject: Bioimaging

Presenting Author: Yanxin Li, Computer Science, University of Western Ontario

Author(s):
Asanka Wickramasinghe , University of Western Ontario
Akila Subasinghe, University of Western Ontario
Jagath Samarabandu, University of Western Ontario
Joan Knoll, University of Western Ontario
Peter Rogan, University of Western Ontario

Abstract:
Cytogenetic biodosimetry is the definitive test for assessing exposure to ionizing radiation. It involves manual assessment of the frequency of dicentric chromosomes (DCs) on a microscope slide, which potentially contains hundreds of metaphase cells. We developed an algorithm that can automatically and accurately locate centromeres in DAPI-stained metaphase chromosomes and will detect DCs (Proc. ICIP 2010, pp. 3613-3616). In this algorithm, a set of 200 metaphase cell images are ranked and sorted. The 50 top-ranked images are used in the triage DC assay. After Gradient Vector Flow (GVF) segmentation, the centerline of each chromosome is derived by medial axis transformation(MAT) and pruned by Discrete Curve Evolution (DCE). The algorithm locates the centromeres at the joint minima of width and intensity profile along centerline. For DCs, our strategy is to detect the first centromere, mask the corresponding pixels, and repeat the procedure to identify the second centromere. To meet the requirement of DCA in a mass casualty event, we are accelerating our algorithm through parallelization. Data-parallelized ranking and GVF codes were tested on a 4-core processor. Whereas serial ranking of 200 images requires ~10.7 sec, MPI-based parallelization completes in 2.7 sec. Serial GVF segmentation requires 5 sec to process 35 metaphase images, compared to parallelized GVF, which completes in 1.3 sec. Overall, we estimate that the automated DCA will require 2.5 min per sample. Our long-term goal is to implement these algorithms on a high performance computer cluster to assess radiation exposures for thousands of individuals in a few hours.


top
B15:
Predicting and Testing Cell Signaling Pathway Requirements in Breast Cancer

Subject: Disease Models & Epidemiology

Presenting Author: Eran Andrechek, Michigan State University

Author(s):
Daniel Hollern, Michigan State University
Inez Yuwanita, Michigan State University
Danille Barnes, Michigan State University

Abstract:
Breast cancer is a heterogeneous disease with key differences apparent in the morphology and gene expression patterns inherent within and between individual breast cancers. This heterogeneity prevents a critical obstacle to successful treatment of the disease. To better understand the heterogeneity in activation of signaling pathways in breast cancer, we have employed training data to generate pathway signatures using a Bayesian Factor Regression Modeling (BFRM) approach. Used in conjunction with a database we assembled of a large number of mouse model breast cancers with unique initiating oncogenic events, we have predicted roles for a number of key cell signaling pathways in specific tumor types. For instance, we have predicted a role for the E2F transcription factors in mouse models overexpressing Myc, Neu(HER2), and PyMT. Subsequent tests using Geneset Enrichment Analysis (GSEA) has confirmed our predictions. However, to directly test these bioinformatic predictions, we have integrated our genomic approach with traditional tests using mouse model systems. By interbreeding the models overexpressing Myc, Neu(HER2) and PyMT with knockout mice for the individual E2F transcription factors we have directly tested our bioinformatic predictions. This has revealed that the E2F transcription factors play key roles in tumors initiated by Myc, Neu(HER2) and PyMT with effects on tumor latency, growth rate, apoptosis and metastasis that are unique to each of the initiating oncogenes. Gene expression from the resulting tumors is being analyzed to determine how E2F targets are differentially regulated and cause these differences.


top
B16:
Phenome-Genome Pathway Networks Analysis: Implicating a Systems Biology Basis for Adverse Interactions of anti-TNF agents and Corticosteroids

Subject: Disease Models & Epidemiology

Presenting Author: Mayur Sarangdhar, Cincinnati Children's Hospital Medical Center

Author(s):
Jeanine Dahlquist, Cincinnati Children's Hospital Medical Center
Bruce Aronow, Cincinnati Children's Hospital Medical Center

Abstract:
OBJECTIVES:
To use integrative analysis for improved understanding of underlying molecular mechanisms of specifically correlated Adverse Events (AE) exhibited by patient subgroups treated with Tnf-inhibitors, methotrexate and corticosteroids.
METHODS:
Using the FDAs manually reviewed Adverse Event Reporting System (AERS), we compared the differential rates of adverse events among TNF-α inhibitory drugs and their relationship to methotrexate and corticosteroids which are concomitantly used in the treatment of autoimmune/inflammatory diseases. Treatment cohorts included tnf-inhibitors (infliximab, etanercept, adalimumab, certolizumab pegol ,golimumab), corticosteroids and methotrexate in monotherapy and concomitant use. Patient records with no indication of autoimmune/inflammatory diseases formed the control group and those with any cancer indication were excluded from the study.
RESULTS:
The exposed group of 195011 (6.71%) from 3 million records showed a substantially higher number of AEs for anti-TNF medications (48.45%) as against methotrexate (1.81%) or corticosteroid monotherapy (1.21). Using anti-TNFs with concomitant corticosteroids dramatically increased occurrence rates of a set of AEs that included interstitial lung disease (ILD), pleural effusions, hypoxia, pulmonary edema, septic shock, sepsis and respiratory failure. For instance, elderly females show increase in ILD occurrence from 6.26 reports per 1000 (anti-TNF monotherapy) to 25 (anti-Tnf+corticosteroids) and is even higher, 39, for anti-TNF+corticosteroids+methotrexate. Similar increase in occurrence is observed for pulmonary edema (2.818 – 9.459 – 10.07).
CONCLUSIONS:
Edema/pulmonary interstitial fibrosis networks were deeply associated with TLR/TNF-α-associated signaling pathways and extensively intersected with glucocorticoid-affected pathways. These results suggest a variety of approaches to modifying therapeutic strategies and recognizing individuals at greatest risk of developing fatal complications from concomitant medications.


top
B17:
The role of histamine in the regulation of carcinogenesis and anti-tumor immunity

Subject: Disease Models & Epidemiology

Presenting Author: Andras Falus, Semmelweis University

Abstract:
Introduction
The availability of technologies of systems- and computational biology provides a novel access in understanding the paracrine and autocrine role of histamine in tumor growth. Moreover, tumor cells modify local histamine synthesis.
Objectives
The lecture attempts to summarize our research data on the role of histamine in generation, maintenance and spreading of experimental tumors.
Materials and methods
Transgenic mice with targeted histidine decarboxylase, in vitro transfection technologies, histology and methylation assays are applied.
Results
1. The role of histamine is demonstrated in tumor growth using mouse melanoma cells manipulated via stable transfection with sense and antisense mouse histidine decarboxylase (HDC) mRNA as well as mock construct, respectively.
Gene expression profiles and in silico pathway analysis of transgenic mouse melanomas, secreting different amounts of histamine show a histamine H1 receptor dependent suppression of expression of the tumor suppressor insulin-like growth factor II receptor and the antiangiogenic matrix protein fibulin-5.
2. HDC-knockout mice show a high rate of spontaneous and induced colon and skin carcinogenesis. HDC is expressed primarily in immature myeloid cells (IMCs) that are recruited early on in chemical carcinogenesis. Transplant of HDC-deficient bone marrow to wild-type recipients results in increased IMC cell mobilization and reproduces the cancer susceptibility phenotype of HDC-knockout mice. In addition, mouse CT26 colon cancer cells directly downregulate HDC expression at epigenetic manner through promoter hypermethylation and inhibit myeloid cell maturation.
Conclusion
The findings indicate bilateral interactions between histamine and tumor development as well as anti-tumor response.


top
B18:
Predicting the Next Pandemic: Influenza A Mutations and Developing Countries

Subject: Disease Models & Epidemiology

Presenting Author: Mary Halpin, Kent State University

Author(s):
Helen Piontkivska, Kent State University

Abstract:
The influenza A virus continues to be a serious threat to human health, with its multiple strains in circulation worldwide composed of different combinations of the hemagglutinin (HA) and neuraminidase (NA) protein subtypes (e.g. H3N2). Indeed, ongoing reassortment and occasional host change may lead to rapid development of pandemic strains characterized by high morbidity and mortality rates, as recently illustrated by the 2009 H1N1 pandemic. Though no pandemic has yet reached the levels of the 1918 Spanish flu which killed over 40 million people out of the world’s then 1.8 billion, the danger is still very real due not only to the rapid mutation rate, but also to the “shrinking” of our world. There is no longer an isolation of one country’s population from another’s, thus, enabling spreading a disease that begins on a small farm in Cambodia to every major city worldwide. Therefore, studying how differences in societies and economic infrastructure (such as access to healthcare and clean water) affect the mutation rates in developing influenza strains is important to pandemic prevention. In this project we examine the rates of viral evolution across different countries to determine whether virus evolves faster in countries with a lower socioeconomic rankings due to lack of healthcare access (leading to higher rates of transmission). By contrasting the changes in influenza strains, both seasonal and pandemic, between countries with very different socioeconomic statuses, the results will show how important aiding the development of these countries is to the health of the world.


top
B19:
A systems biology approach to understanding the efficiency of the lymph node in producing primed T cells

Subject: Disease Models & Epidemiology

Presenting Author: Chang Gong, University of Michigan

Author(s):
Josh Mattila, University of Pittsburgh
Paul Wolberg, University of Michigan
Mark Miller, Washington University School of Medicine
JoAnne Flynn, University of Pittsburgh School of Medicine
Jennifer Linderman, University of Michigan
Denise Kirschner, University of Michigan

Abstract:
Dendritic cells (DCs) ingest foreign material (antigen) present during infection, process antigen for display on their cell surface, and migrate to T cell zones of lymph nodes (LNs), where they search for antigen specific cognate T cells and initiate a cascade of events leading to priming. These primed T cells then return to the site of infection to lead the body’s defense. Recent studies employing two-photon microscopy (2PM) have significantly advanced our knowledge of T cell motility and the behavior of cognate T cells in the presence of antigen-bearing DCs within LNs, but many unanswered questions remain. For example, it is difficult to relate the short length- and time-scale measurements of 2PM to efficiency of LNs in producing primed T cells. We developed a 3 dimensional (3D) agent-based model representing the T cell zone of LNs, allowing for rapid in silico simulation of T cell zone function. We calibrated the model according to primate LN section imaging and 2PM data. We used the model to explore the effect of T cell zone morphology on LN efficiency and used uncertainty and sensitivity analysis to predict which mechanisms contribute significantly to the production of primed T cells. Our systems biology approach provides a platform not only to understand but also to guide manipulation of LN function in the context of disease.


top
B20:
Molecular evolution of CTL epitopes in worldwide HIV-1 genome

Subject: Disease Models & Epidemiology

Presenting Author: Reeba Paul, Kent State University

Author(s):
Joel Serre, Kent State University
Michael Rose, Kent State University
Patrice Conway, Kent State University
Sinu Paul, Kent State University
Helen Piontkivski, Kent State University

Abstract:
During viral infection, the interactions between the host immune system, such as class I major histocompatibility complex molecules, and viral cytotoxic T lymphocyte (CTL) epitopes, play a major role. The persistent positive selection pressure from the immune system often results in amino acid changes in CTL epitopes leading to “escape”.
Interestingly, some epitopes appear to harbor very low levels of amino acid substitutions despite ongoing interactions with the immune system. We have recently described a set of so-called “associated epitopes” (Paul and Piontkivska 2009, 2010) that consists of CTL epitopes that frequently co-occur together among different subtypes of HIV-1 and exhibit signs of strong purifying selection. However it is unclear whether the selective pressure acts uniformly across different associated epitopes. In this study we examined patterns of genomic changes in various CTL epitope regions from HIV-1 genomes sampled worldwide to better understand the forces driving sequence changes at these epitopes.


top
B21:
Synonymous Codon Usage in Highly Expressed Genes amongst Bacteria

Subject: Sequence Analysis

Presenting Author: Patrick Schreiner, Loyola University of Chicago

Author(s):
Catherine Putonti, Loyola University of Chicago
Bryan Quach, Loyola University of Chicago
Adam Hilterbrand, University of Texas
Joseph Saelens, Duke University

Abstract:
Biases in the usage of synonymous codons have been observed across all branches of the tree of life. Mutational biases, drift, and translational selection have all found support in shaping codon usage. Within bacterial genomes, these biases range from the relatively neutral to quite strong. Moreover, biases are often most prominent within those genes which are most highly expressed, a reflection of tRNA abundances within the cell. Using a number of different metrics to quantify codon usage biases, we have computed the codon usage within the highly expressed genes (HEGs) for over 1700 bacterial strains. From these computations, we are able to compare codon usage between strains, between species, as well as between more distantly
related bacterium.


top
B22:
MBNI Cloud : High Performance Computing Optimized for Omics Data Analysis in Biomedical Research

Subject: other

Presenting Author: Manhong Dai, University of Michigan

Author(s):
Fan Meng, University of Michigan

Abstract:
The analysis of deep sequencing and other high throughput genomic, transcriptomic, proteomic and metabolomic data from biomedical research demands powerful data processing capabilities. The omics data analysis has several unique requirements that are often not supported by typical high performance computing clusters, such as large physical memory, huge long term hard drive storage, both classic and MapReduce computing, database support for functional annotation and privacy protection. The MBNI Cloud is a grassroots effort at creating a highly scalable solution addressing omics data analysis needs. It now supports both classic cluster computing and the Hadoopp computing tasks. The MBNI Cloud is supported by researchers from several departments through a very cost-effective co-op model (http://cluster.mbni.med.umich.edu/ ).


top
B23:
iPathCase

Subject: other

Presenting Author: A. Ercument Cicek, Case Western Reserve University

Author(s):
Stephen R. Johnson, Case Western Reserve University
Xinjian Qi, Case Western Reserve University
Gultekin Ozsoyoglu, Case Western Reserve University

Abstract:
PathCase family of applications has been developed in the last decade to integrate various biological data sources, and to provide extensive functionalities such as browsing, querying and visualizing biological data. PathCase systems are online, and frequently used by systems biology researchers all over the world. Recognizing the rise of tablet devices and their significance in education, we have developed iPad applications for two of the subsystems in the PathCase family, namely, PathCaseMAW (Metabolomics Analysis Workbench) and PathCaseKEGG (featuring KEGG Pathways). These iPad applications, named iPathCase, also allow users to browse, visualize and analyze biological pathways through the multi-touch interface of iPad. They are released and available for download in the Apple App Store, free of charge. Our goal with iPathCase applications is to provide a mobile interface for researchers to access the biological data and to expand the audience of PathCase from researchers to students in biology-related fields.


top
B24:
Providing Bioinformatics Instruction and Support to the Biomedical Informatics Community

Subject: other

Presenting Author: Jean Song, University of Michigan

Author(s):
Marci Brandenburg, University of Michigan

Abstract:
With the number of tools and technologies being used to store, retrieve, and analyze proteomic, genomic, and metabolomic data, there is an increasing need for additional support to the biomedical community, which can be provided by librarians, specifically Bioinformationists. This poster covers several ways in which librarians can support biomedical research and the bioinformatics community. They can provide instruction in a variety of ways, including hands-on sessions, webinars, and user manuals. The University of Michigan Bioinformationist has learned to use several bioinformatics resources, such as Cytoscape, ConceptGen, and LRpath. Although the Bioinformationist started by offering hands-on Cytoscape training sessions, webinars are also offered to reach an expanded audience. The Bioinformationist updates user manuals for new versions of tools, making it easier for users to teach themselves. In the spring of 2011, the Bioinformationist jointly conducted a usability study on three network visualization software tools: Cytoscape, VisANT, and ConceptGen. Through these methods and others, the Bioinformationist has played an important role in helping researchers learn how to use bioinformatics tools for their work and is continuously brainstorming on new tools to teach, new methods for teaching these tools, and new means for providing instruction to further support the biomedical community.


top
B25:
Efficient structure prediction for non-coding RNAs including pseudoknots

Subject: Sequence Analysis

Presenting Author: Rujira Achawanantakun, Michigan State University

Author(s):
Yanni Sun, Michigan State University

Abstract:
The functions of many non-coding RNAs are determined by both their sequences and secondary structures. Pseudoknot is an important structural element found in many types of non-coding RNA. The state-of-the-art structure annotation tools derive consensus structure of homologous non-coding RNAs and have better accuracy than ab initio folding tools. These consensus structure prediction tools first align homology sequences, and then predict consensus structure from the alignment. The quality of a predicted structure heavily relies on the quality of the alignment. However, some types of non-coding RNAs lack strong sequence similarity and cannot be aligned using sequence alignment tools. Thus, there is a need for more efficient and accurate non-coding RNA secondary structure prediction methods.

In this work, we proposed a novel method for non-coding RNA secondary structure prediction including pseudoknots. The method is based on grammar strings, which encode both sequence and secondary structure in the parameter space of a context-free grammar. There are two major steps in our method: shape analysis and consensus structure derivation. In the shape analysis step, we start with extracting the shape context features from the non-coding RNA structures and adopt the feature selection method for shape ranking. The selected shape is then used as a guide for generating the consensus structure that corresponds to the shape. Since the consensus structure derivation step takes not only sequence as input, but also considers structure, this leads to a better quality of predicted structures over other tools.


top
B26:
Comparative Analysis of Short Read Mapping Algorithms for Transcriptome References of Varying Completeness

Subject: Sequence Analysis

Presenting Author: Alexis Black Pyrkosz, United States Department of Agriculture

Author(s):
Hans Cheng, United States Department of Agriculture
C. Titus Brown, Michigan State University

Abstract:
Next-generation sequencing (NGS) techniques as applied to transcriptomes reveal key information about gene expression. However, the first step in the sequence analysis computational pipeline involves aligning short reads to a reference transcriptome. While complete transcriptomes are available for model organisms, such as human and mouse, the availability and completeness of transcriptomes for other organisms is sometimes limited. Our goal is to determine the effect of transcriptome completeness on accurate read mapping.

We performed a comparative study of commonly used short read mapping algorithms' accuracy when used with reference transcriptomes of varying completeness. Using Ensembl cDNA for a complete transcriptome from a model organism and incomplete transcriptomes from nonmodel organisms, we simulated reads with 1% substitution error and mapped them to the reference transcriptome at intervals of reference completeness ranging from 50% to 100%. Our results indicate that the largest factor in the mapping accuracy is the presence of reads that match multiple transcripts due to isoforms. The methods available for assigning these reads (random, unique, or multimap) are each explored and indicate that none of them solves the isoform problem. As the percent completeness of the reference transcriptome decreases, thereby having reads in the data set that do not have a matching transcript in the reference, the false positive rate increases. Our results indicate that incomplete transcriptomes distort the read mapping, highlighting the necessity for complete references early in the current computational pipeline.


top
B27:
Quality Control in RNA-Seq

Subject: Sequence Analysis

Presenting Author: Phillip Dexheimer, Cincinnati Children's Hospital Medical Center

Author(s):
Mehdi Keddache, Cincinnati Children's Hospital Medical Center
Bruce Aronow, Cincinnati Children's Hospital Medical Center

Abstract:
RNA sequencing (RNA-Seq) has rapidly replaced microarrays as the method of choice for assessing RNA expression on a genome-wide scale. Identifying and normalizing or removing poorly performing samples has long been a requirement for microarray analysis, because poor performers increase noise in the data and obscure biologically significant patterns. Despite the previously established need for quality control, calibration, and technology-dependent corrections in expression analysis, we lack an organized framework for evaluating categories of poor or differentially affected technical characteristics of RNA-Seq samples. We present here a comprehensive quality control report for RNA-Seq experiments that captures such diverse metrics as ribosomal RNA quantity, fraction of reads that align to exons, and 3'/5' bias. We show that collecting these metrics in a consistent and principled manner and presenting them in a coherent fashion aids significantly in the identification of poor performers and errors in primary analysis.


top
B28:
A comprehensive polyadenylation site map in Human genome based on both Sanger and Direct RNA Sequencing data

Subject: Sequence Analysis

Presenting Author: cheng guo, Miami University

Author(s):
Min Dong, Miami University
Q. Quinn Li, Miami University
Chun Liang, Miami University

Abstract:
We report a comprehensive map of global polyadenylation events in human genome using cDNA/mRNA data obtained from Sanger sequencing and Direct RNA Sequencing (DRS). Our bioinformatics pipeline for data processing and analysis has been improved in the following aspects. Poly(A) tails in cDNA/mRNA can be more accurately identified using our in-house program that utilizes Hidden Markov model. To accurately determine the poly(A) site along genome sequences, we systematically compared popular cDNA-to-genome mapping tools - GMAP, GSNAP and Helisphere (Helicos). Helisphere seems to be able to generate more accurate alignment results for poly(A) site determination, with higher sensitivity and specificity. An unique algorithm has been developed to cluster adjacent poly(A) sites to characterize and differentiate micro-heterogeneity/macro-heterogeneity of poly(A) sites. More than 60,000 high-quality poly(A) sites are identified in human genome, which will facilitate our understanding of molecular mechanisms involved in polyadynaltion. In particular, our bioinformatics pipeline/protocol has been designed to be generic so that it can be easily utilized for other species as well as for new data like Illumina and 454 sequence reads.


top
B29:
Mind the Gaps - GapMine 1.0

Subject: Sequence Analysis

Presenting Author: Jeremy Harris, Medical College of Wisconsin

Author(s):
Brandon Wilk, Medical College of Wisconsin
Sharon Tsaih, Medical College of Wisconsin
Goerge Kowalski, Medical College of Wisconsin
Jeff DePons, Medical College of Wisconsin
Elizabeth Worthey, Medical College of Wisconsin

Abstract:
Next Generation sequencing technology has come a long way from its infancy, and today provides a relatively cost effective way of sequencing individual genomes to high depths of coverage. Comparison of the sequence data from these individual genomes against the reference genome can then be used to identify variant loci that may be associated with differences in phenotype. In some cases when searching for variants associated with a phenotype a researcher will carry out initial analysis using a set of candidate genes of known and potentially associated functions. Under these circumstances it is important to be able to determine whether these regions have been covered to sufficient depth for accurate variant calling. It is anticipated that between 5 and 8% of the reference will not be covered in a particular illumina sequence. This is in addition to the 5% of sequence data already believed to be missed from the reference sequence itself. Our application has been developed to efficiently and effectively identify gaps in Illumina generated sequence coverage using the mapped sequence read data. The sequence data is analyzed nucleotide by nucleotide; a user specified depth of coverage threshold is applied for each run. Regions that fall below the minimum depth of coverage threshold are written to an output file. Finally, the tool cross references these under represented regions against a database of genomic features to produce a summary of the genomic features of interest not sufficiently covered. This information is both summarized and output in a easily viewed format.


top
B30:
A Multispecies Polyadenylation Site Model

Subject: Sequence Analysis

Presenting Author: Eric Ho, Rutgers University-New Brunswick

Author(s):
Samuel Gunderson, Rutgers University-New Brunswick
Siobain Duffy, Rutgers University-New Brunswick

Abstract:
Polyadenylation occurs in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5’-capping. We are interested in the evolution of polyadenylation sites (PAS) in diverse species and DNA viruses. Even though most mammalian PAS contain a highly conserved hexanucleotide in the upstream region, namely the canonical poly(A) signal, and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, PAS in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general PAS recognition model challenging. We surveyed nine PAS prediction methods between 1999 and 2011. All of them exploit the skewed nucleotide profile across the PAS, and the highly conserved poly(A) signal as the primary features for recognition. The number of features utilized by these methods is usually large (from 15 to 274), which attributes to the problematic dimensionality curse. Here we propose a PAS model that employs minimal features to capture the essence of PAS, and yet, produces better prediction accuracy across diverse species. Our model utilizes three di-or-trinucleotide profiles, and the predicted nucleosome occupancy in the region 300 nucleotides upstream and downstream of the PAS. We validated our model using two machine learning methods viz. logistic regression and linear discriminant analysis. Results showed that the model achieves 85-92% sensitivity and 85-93% specificity in human, chicken, C.elegans, and Arabidopsis thaliana. Applying our PAS model across species can shed light on the evolution of PAS.


top
B31:
CBrowse: a BAM-based contig browser for transcriptome assembly visualization and analysis

Subject: Sequence Analysis

Presenting Author: Pei Li, Miami University

Author(s):
Guoli Ji, Xiamen University
Chun Liang, Miami University
Emily Schmidt, Miami University

Abstract:
To address the impending need for exploring rapidly increased transcriptomics data generated for non-model organisms, we developed CBrowse, an AJAX-based, cross-platform browser for visualizing and analyzing transcriptome assemblies and contigs. Designed in standard 3-tier architecture with a data pre-processing pipeline, CBrowse is essentially a Rich Internet Application that offers many seamlessly integrated web interfaces and allows users to navigate, sort, filter, search and visualize data easily and smoothly. CBrowse takes the contig sequence file in FASTA format and its relevant SAM/BAM file as the input, detects putative poly-morphisms, simple sequence repeats and sequencing errors in con-tigs, and generates image, JSON and MySQL-compatible CSV text files that are directly utilized by different web interfaces. CBowse is a generic visualization and analysis tool that facilitates close exami-nation of assembly quality, genetic polymorphisms, sequence re-peats and/or sequencing errors in transcriptome sequencing pro-jects.


top
B32:
Interactive RNA Secondary Structure Graphical Interface with Free Energy Calculation

Subject: Sequence Analysis

Presenting Author: Andrew Marmaduke, University of Akron

Author(s):
Jeff Chapman, University of Akron

Abstract:
RNA is a basic builder and transcriber for important proteins that are located in all living organisms. Similar to protein, RNA will spontaneously fold into structures, known as their native state or the lowest energy state. The folded structure is vital to the function of the RNA. A graphical interface was developed in Java to visualize the completed two-dimensional fold, or secondary structure, as well as associated energy. The GUI displays the elements of a secondary structure: backbone, nucleotides, and base pairs. The free-form layout, most commonly used by scientists, and the Feynman Diagram were both implemented to display the RNA. Two algorithms were implemented to calculate free energy: the Vienna algorithm and the ZUKER algorithm although other algorithms were also considered. Visual presentation of possible base pair options and associated effects were also examined. This GUI is used primarily for the visualization of the secondary structure as well as free energy, allowing for complete manipulation of nucleotide positions and base pairing as well as reviewing the free energy associated the secondary structure under consideration.


top
B33:
Phylogenetic Analysis of the Aldehyde Dehydrogenase Superfamily

Subject: Sequence Analysis

Presenting Author: John Perozich, Franciscan University of Steubenville

Author(s):
Troy Wymore, Pittsburgh Supercomputing Center
Hugh Nicholas, Pittsburgh Supercomputing Center
John Hempel, University of Pittsburgh

Abstract:
Aldehyde dehydrogenases (ALDHs) are a superfamily of ubiquitous enzymes that catalyze the oxidation of aldehydes to their corresponding carboxylic acids. Most organisms have several distinct ALDHs that take part in a variety of physiological roles. Previous ALDH alignments supported a conserved ALDH structure, suggested residues with important structural and functional roles and identified thirteen distinct ALDH families. In this research phylogenetic analysis was performed on an updated alignment of 1310 ALDH protein sequences. Eighteen previously and newly recognized ALDH families were clearly distinct in the phylogenetic tree. Subgroupings for many families based either on substrate, organism or subcellular location were also apparent. ALDHs now represent a true superfamily of enzymes, as it now contains the Δ1-Pyrroline-5-Carboxylate Synthetase and ADHE families which both have activities which are the reverse of classic ALDHs (conversion of acids to aldehydes). In addition, several more potential ALDH families, such as lactaldehyde, glyceraldehyde and 6-oxohexanoate dehydrogenases, may exist. However, most sequences in those groups lack reported functional activity needed to confirm their identities. Also, several groups of bacterial ALDHs with unidentified activity exist. As many sequences submitted to databases are named based on homology, and not confirmed activity, the actual identity of these ALDHs and the reason for their clustering in the tree remains unclear.


top
B34:
Comparative Analysis of the Sialidase Family

Subject: Sequence Analysis

Presenting Author: Anne McMahon, Franciscan University of Steubenville

Author(s):
John Perozich, Franciscan University of Steubenville

Abstract:
Sialidase removes terminal sialic acid residues from glycolipids and glycoproteins. Sialidases are found not only in prokaryotes and eukaryotes, but also in viral envelopes. To date, there have been few studies that include all three major groups of sialidases. The purpose of this undergraduate research is to compare viral, prokaryotic, and eukaryotic sialidases, looking for similarities in structure and function, as well as to better understand the phylogenetic relationships among these groups. Alignment of 103 protein sequences was guided using a structural alignment due to the disparity between viral and non-viral sialidase sequences. Despite highly conserved tertiary structures, only three residues were invariant: Gly-203, Arg-237, and Arg-304. Additionally, eight residues were conserved above 90%. Most of these residues serve critical roles in substrate coordination and catalysis, while others maintain overall structure. Arg-21, Arg-237, and Arg-304 form a triad that coordinates the substrate’s carboxylate group, along with Tyr-334. Glu-218 further stabilizes the substrate. Asp-46 interacts with C-2 of sialic acid, cleaving the glycosidic bond. Glu-355 forms a salt bridge with Arg-21, and Arg-243 also adds stability to the active site. Pro-23 positions Arg-21, as well as Ile-22, conserved only in non-viral sialidases. Pro-267, Asp-92, and Gly-203 help to position their respective hairpin loops. Pattern analysis demonstrates the disparity of motif conservation between viral and non-viral sialidases. Phylogenetic analysis shows defined groupings for fungal, mammalian, and Influenzas A and B, while bacterial sialidases were less organized.


top
B35:
Comparative Analysis of Pyruvate Kinases

Subject: Sequence Analysis

Presenting Author: Alyssa Morey, Franciscan University of Steubenville

Author(s):
John Perozich, Franciscan University of Steubenville

Abstract:
Pyruvate Kinase (PK) is an enzyme crucial for completion of glycolysis in all living organisms. Defective forms of PK cause disorders such as mammalian hemolytic anemia. This undergraduate research project compared human erythrocyte PK with other proteins in the PK enzymes from various species and investigated structural, functional, and phylogenetic relationships. PK sequences from 60 organisms were aligned. Only three of the 543 residues (Gly-338, Asp-268, Glu-387) were fully conserved, all of which have roles in structural integrity of the protein for proper intramolecular and substrate interactions. 249 residues had at least 60% conservation in aligned sequences, indicating high structural conservation. Overall, the Pyruvate Kinase protein sequence is highly conserved through all living organisms due to its pivotal metabolic role, suggesting that mutations are not sufficiently stable to allow proper interactions for functionality. The most highly conserved residues are found in the active site, indicating the critical importance of maintaining functionality in this protein, and also subunit interfaces. Significantly fewer residues are conserved in the allosteric site. Pattern analysis indicated the two most well conserved motifs contain functional residues in the active site. Phylogenetic analysis revealed distribution of the sequences based upon taxonomic relationships.


top
B36:
Comparative Analysis of Insulysins

Subject: Sequence Analysis

Presenting Author: Katie Kirrane, Franciscan University of Steubenville

Author(s):
John Perozich, Franciscan University of Steubenville

Abstract:
Insulysin or Insulin Degrading Enzyme (IDE) is a zinc metallopeptidase in the M16 family. There have been links shown between IDE and both type II diabetes and Alzheimer’s disease. The goal of this undergraduate research project was to compare the amino acid sequences from 41 organisms, looking for structural, functional, and phylogenetic relationships. There were 42 invariant residues. The functions for many of the fully conserved residues were identified. In human IDE the metal center is chelated by His-108, His-112, and Glu-189. Glu-182 plays a fundamental role in influencing metal recognition and binding by zinc proteins. Glu-111 deprotonates the water molecule that completes the coordination sphere of zinc. Arg-824, Tyr-831 and Glu-189 form hydrogen bonds with two carbonyl oxygens and one amide hydrogen of the substrate. Two hydrogen bonds are formed by Glu-182 and His-112 and from Thr-220 and Gly-221 to His-118. These have a stabilizing effect on the whole structure of the enzyme. In rat IDE Gly-339, Leu-359, and Gly-341 interact with the N-terminal amino acid group. Tyr-609 and His-332 and the main chain carbonyl of Gly-361 may make hydrogen bonds with the peptide backbone. Phylogenetic analysis shows clear distinction between animal and fungal IDEs.


top
B37:
Comparative Analysis of Aromatic Amino Acid Hydroxylases

Subject: Sequence Analysis

Presenting Author: Christopher Maguire, Franciscan University of Steubenville

Author(s):
John Perozich, Franciscan University of Steubenville

Abstract:
Aromatic amino acid hydroxylases are a family of closely related proteins involved in metabolizing amino acids that have an aromatic ring. The goal of this undergraduate research project was to compare the amino acid sequences of tyrosine (TyrOH), phenylalanine (PheOH), and tryptophan hydroxylases (TrpOH) from a variety of organisms and look for structural, functional and phylogenetic relationships. A total of 108 sequences (55 PheOH, 27 TrpOH, 26 TrpOH) were compared. Eleven fully conserved residues were identified (S-349, P-281, G-289, G-344, H-285, H-290, E-330, P-225, R-270, D-282, E-353) and 94 were at least 60% conserved (amino acid position numbers were taken from Human PheOH). Functions were identified for many of these conserved sequences. H-290, H-285, and E-330 were found to participate in iron binding. E-353 hydrogen binds to the main chain nitrogen of E-383 which is 99% conserved. S-349 is positioned very close to the substrate, and also hydrogen binds to H-285 to help position α-helix-7. P-281 helps to define the shape of the active site and is in close proximity to the bound iron atom. R-270 and D-282 form a salt bridge with each other and they may also interact with the closely bound TEA molecule. F-254, which is 99% conserved, is involved in a ring stacking interaction with the pterin ring of the bound THB molecule. Phylogenetic analysis indicated distinct functional groupings for the three enzymes.


top
B38:
Efficient Detection and Correction of Sequencing Errors Using K-Bounded Suffix Trees

Subject: Sequence Analysis

Presenting Author: Daniel Savel, Case Western Reserve University

Author(s):
Mehmet Koyutürk, Case Western Reserve University
Thomas LaFramboise, Case Western Reserve University
Wojciech Szpankowski, Purdue University
Ananth Grama, Purdue University

Abstract:
Next generation sequencing technologies produce large quantities of short reads with an error rate that adversely affects effective use of these reads. One of the primary uses of sequencing data is de novo genome assembly, which is complicated and can be obfuscated by sequencing errors. Unfortunately, the first time the assembly process is done, no reliable reference sequence is available to compare against; therefore error detection and correction can only use the set of available reads as a reference. State-of-the-art methods for error detection and correction utilize the frequencies of the substrings of the reads, based on the principle that low-frequency substrings may point to sequencing errors. One of the main limiting factors of the error correction methods is the amount of memory required to perform the detection and correction procedures as the set of reads from the sequencer is typically very large and thus the set of substrings of all those reads is also very large. Existing methods typically store comprehensive sets of the correction unit using an efficient data structure, either k-mers (using hash tables) or suffixes (using suffix trees). However, in these data structures, each error manifests itself multiple times, causing redundancy. Here, we propose a method for leveraging the level of memory reduction against accuracy and relating it to remaining error manifestations. Our experimental results show that better performance and accuracy in error correction can be achieved by reducing the amount of data stored in the data structure.


top
B39:
The Divergence of Shift Scores and Structural Alignment Scores

Subject: Sequence Analysis

Presenting Author: Michael Sierk, Saint Vincent College

Abstract:
A critical component of producing a homology model of a protein is the alignment of the protein sequence to be modeled and the sequence of the template molecule whose structure is known. As part of the process of testing and evaluating different alignment methods, one needs to compare the quality of various sequence-based alignments of the same two sequences against the correct alignment based on the 3D structures of the two proteins. As a way to efficiently compare many such alignments, Cline et al. (Bioinformatics 18(2). 2002.) developed the Shift Score, which is a metric that accounts for both very accurate alignments over shorter distances and alignments that are mostly correct over longer distances but may be out of register by a handful of amino acids. In the process of assessing sets of suboptimal sequence alignments it was discovered that many of these alignments had similar Shift Scores against a reference structural alignment, while having very different structural similarity scores when the sequence alignment was used to produce a structural alignment. This discrepancy holds for a variety of different structural similarity measures. Here I analyze this discrepancy and explore its ramifications for assessment of alignment accuracy. I also propose a modified version of the Shift Score that reduces the divergence between the Shift Score and structural similarity score.


top
B40:
Frnakenstein: multiple target inverse RNA folding

Subject: Sequence Analysis

Presenting Author: Elena Sizikova, University of Oxford

Author(s):
Rune Lyngsoe, University of Oxford
James Anderson, University of Oxford
Tomas Hyland, University of Oxford
Amarendra Badugu, ETH Zurich

Abstract:
Motivation: RNA secondary structure prediction, or folding, is a classic problem in bioinformatics: given a sequence of nucleotides, the aim is to predict the base pairs formed in its three dimensional conformation. The inverse problem of designing a sequence folding
into a particular target structure has only more recently received significant interest. With an growing appreciation and understanding of the functional and structural properties of RNA motifs, and a growing interest in utilising biomolecules in nano-scale designs, the interest in the inverse RNA folding problem is bound to increase. However, whereas the RNA folding problem has an elegant and efficient solution, the inverse RNA folding problem appears to be hard. We present a genetic algorithm approach to solve the inverse folding problem. The method performs well compared to other existing methods. It further addresses the hitherto mostly ignored extension of solving the inverse folding problem for multiple target
structures, allowing designs of artificial ribo-switches.

Results: The genetic algorithm has been implemented as a Python program. It was benchmarked against four existing methods and several data sets totalling 755 real and predicted targets. It performed as well or better than all existing methods, without the heavy
bias towards CG base pairs that was observed for all other top performing methods. On 200 two-structure targets it also performed well, generating a perfect design for about two thirds of the targets.


top
B41:
CHURCHILL: A Comprehensive Analysis Pipeline for Discovery of Human Genetic Variation

Subject: Sequence Analysis

Presenting Author: Yangqiu (Patrick) Hu, Nationwide Children's Hospital

Author(s):
David Newsom, Nationwide Children\'s Hospital
Ben Kelly, Nationwide Children\'s Hospital
Travis Casper, Nationwide Children\'s Hospital
Huachun Zhong, Nationwide Children\'s Hospital
Peter White, The Research Institute at Nationwide Children\'s Hospital

Abstract:
Next generation sequencing (NGS) technologies have revolutionized genetic research and have empowered a dramatic increase in the discovery of new functional variants that are responsible for both Mendelian and common diseases. The output from NGS instrumentation is profoundly out-pacing Moore’s Law, and currently a single Illumina Hiseq 2000 is capable of producing 600 billion bases of sequencing output in under two weeks. Compounded by sharply falling sequencing costs, this exponential growth in NGS data generation has created a computational and bioinformatics bottleneck in which current approaches can take months to complete analysis and interpretation. This reality has created an environment where the cost of analysis exceeds the cost of physically sequencing the sample.

To overcome these challenges we have developed a computational pipeline (CHURCHILL) that fully automates the multiple steps required to go from raw sequencing reads to comprehensively annotated genetic variants. Through implementation of novel parallelization approaches we have dramatically reduced the analysis time from weeks to hours. Compared with GATK-Queue script, our workflow implementation is simpler, faster, and more widely applicable to various shared memory/distributed High Performance Computing clusters. Furthermore, our modular pipeline has been designed with the flexibility to incorporate other analysis tools as they become available. Through comprehensive automation and parallelization of genome data analysis pipelines we present an elegant solution to challenge of rapidly identifying human genetic variation in a clinical research setting.


top
B42:
Assessment of Alignment and Variant Calling Approaches for Analysis of Human Exome Capture Sequencing Data

Subject: Sequence Analysis

Presenting Author: Ben Kelly, The Research Institute at Nationwide Children's Hospital

Author(s):
Yangqiu (Patrick) Hu, The Research Institute at Nationwide Children\'s Hospital
Travis Casper, The Research Institute at Nationwide Children\'s Hospital
David Newsom, The Research Institute at Nationwide Children\'s Hospital
Huachun Zhong, The Research Institute at Nationwide Children\'s Hospital
Wesley Banks, The Research Institute at Nationwide Children\'s Hospital
Gail Herman, The Research Institute at Nationwide Children\'s Hospital
Peter White, The Research Institute at Nationwide Children\'s Hospital

Abstract:
Popular utilization of next-generation sequencing for the identification of human genetic variation has led to the development of numerous tools to process raw sequencing data and subsequently identify polymorphisms. Analysis of sequencing data can be broken into two computational steps: alignment of the sequencing reads to a reference genome and variant calling from that alignment. A variety of aligners and variants callers exist, but few complete pipelines exist. Given that each algorithm has respective strengths and weaknesses we set out to determine the optimal combination of these tools.
In order to determine the ideal alignment and variant calling algorithmic pairing we utilized a novel, two pronged approach. First, we generated a synthetic data set that closely resembles a typical exome experiment in coverage, polymorphisms (both SNPs and INDELs), base quality, and sequencing error rate. Sensitivity and specificity were measured for each alignment and variant calling combination. Second, we utilized multiple true experimental exome trios (child, mother, and father) to further evaluate the algorithm combinations using Mendelian error rates and transition/transversion ratios.
In both synthetic and real-world testing, SNP calling was consistent among most combinations tested. During INDEL calling, however, we identified alignment and variant calling combinations that significantly increased identification of true positives and reduced the calling of false positives. We recommend that no matter which alignment algorithm is used, local realignment and base quality recalibration should be performed before variant calling on as many variants as computationally feasible.


top
B43:
A Hybrid Approach to De novo Assembly of Microbial Genomes using Short Read Sequencing Data

Subject: Sequence Analysis

Presenting Author: Travis Casper, The Research Institute at Nationwide Children's Hospital

Author(s):
Travis Casper, The Research Institute at Nationwide Children\'s Hospital
Patrick Hu, The Research Institute at Nationwide Children\'s Hospital
Robert Munson, The Research Institute at Nationwide Children\'s Hospital
Peter White, The Research Institute at Nationwide Children\'s Hospital

Abstract:
De novo genome assembly using next generation sequencing data remains challenging despite significant developments in assembly algorithms and software tools. Typical reads from current high throughput sequencing platforms range from 50 to 400 bases in length, whereas typical microbial genomes have 2 to 5 million bases. The challenge of de novo assembly from short read sequencing data is further complicated by the existence of repeated regions of the genome, and compounded by sequencing errors and inherent sequence biases unique to the different sequencing platforms.
We report our experience in developing a hybrid approach to solve this problem: we combine the results from multiple de novo assemblers, with results from read mapping based algorithms using existing genomes of closely related strains. Simulated and real sequencing data were used to validate the results. Assembly performance was measured using number of contigs, N50, correctly oriented and joint contigs, percentage of reads mapped to the assembly, percentage of correctly paired mapped reads, and percentage of target genome covered by assembly. We show that this hybrid approach achieves significantly better results than the individual methods alone, and is able to produce high quality draft assemblies when long insert size libraries are available. Furthermore, we show that by simulating sequencing data it is possible to reliably predict the input library requirements in order to sequence and assemble a high quality microbial genome. Our approach is efficient and flexible, and has the potential to accommodate new de novo methods in the future.


top
B44:
ROOSEVELT: An Interactive Tool for Tertiary Analysis and Visualization of Human Genetic Variants

Subject: Sequence Analysis

Presenting Author: Donald Corsmeier, The Research Institute at Nationwide Children's Hospital

Author(s):
Ben Kelly, The Research Institute at Nationwide Children\'s Hospital
Yangqiu Hu, The Research Institute at Nationwide Children\'s Hospital
Travis Casper, The Research Institute at Nationwide Children\'s Hospital
Peter White, The Research Institute at Nationwide Children\'s Hospital

Abstract:
Biologists have set the expectation on the molecular bioinformatician to transform massive and complex raw human genomic data sets into accurately refined subsets from which they may easily derive meaningful insights. Sequencing the genomes from a single study can yield terabytes of read data which, in turn, is utilized to produce a list of variations consisting of millions of single nucleotide polymorphisms (SNPs), insertions and deletions (INDELs). The quality and presentation of these datasets are paramount to the successful interpretation of the results by the researcher or clinician and will be an underlying theme in the success of personalized medicine.

As part of our pipeline for genomic data analysis (CHURCHILL), we utilize standard bioinformatics approaches and databases of human genetic variation, as well as both heuristic and probabilistic methods to provide the basis for initial filtration of the variants. Predictions of alteration in protein function, gene pathway information, and various other annotation databases can then be used to prioritize damaging variants based on relevance to the study in question. Using standard Microsoft .NET framework technology, we have developed a tool to present the results of genomic data analysis in a familiar graphical user interface that allows for easy navigation and provides an intuitive means to explore the data as well as associations in the literature.

The tertiary analysis and reporting software presented strives to meet the significant challenge of successful interpretation and integration of next generation sequencing data into the pipeline of scientific discovery.


top
B45:
High Throughput Virus Sequence Analysis

Subject: Sequence Analysis

Presenting Author: Matt Wollerman, Case Western Reserve University

Author(s):
Frank Esper, University Hospitals

Abstract:
The primary goal of this project is to develop new algorithms for evaluating high throughput SNP variants. To accomplish this, the diversity of multiple virus populations such as respiratory syncytial virus (RSV) and human metapneumovirus (hMPV) within a patient’s genome will be evaluated. Existing tools will be used to analyze the large scale sequence data from the infected patients. Clinical samples confirmed to have either RSV or hMPV will undergo analysis to determine the variance across the entire viral genome. The sample will then be aligned to RSV using a Bowtie index and SNP calling will be done through analyzing mpileup data generated by SAMtools. Then, new algorithms will be developed to analyze these SNP variants.


top
B46:
Genome Motif Finder – A MatLab framework for genome-wide motif finding and analysis

Subject: Text Mining

Presenting Author: Sreeskandarajan Sutharzan, Miami University

Author(s):
Nicholas Uth, Miami University
Chun Liang, Miami University

Abstract:
General Feature Format (GFF) is widely used to store gene annotation information and other genomic features for genomic sequences. Data mining from GFF files would yield biologically significant information. Identifying sequence patterns or motifs in genomic sequences has many biological applications. For example, presence of the sequence pattern AGAAAAAAAAA in APC codons is associated with colorectal cancer. Search for such signal in human genome would aid in oncology studies. MatLab is increasingly being used in bioinformatics application development because of its rich tool boxes and rapid prototyping ability. We have developed a MatLab pipeline or framework that can mine genome sequences and extract desired sequence motifs using regular expressions and GFF files. The pipeline associates the genetic features presented in GFF files to the chromosome sequences in FASTA files, and permits customizable text mining using MatLab functions or modules. For example, a user can easily search a sequence pattern (e.g., AAAAAA) in a desired region of the human genome (e.g., 3' UTR, intron or CDS), visualize the result and create outputs, using this MatLab pipeline. In particular, the pipeline is designed to be generic so that it accepts chromosome sequences of any organism in FASTA format, a GFF file with genome annotation information, and the sequences pattern in regular expression. Clearly, this MatLab pipeline will be a valuable addition to the tool box for biologists in genomics data analysis.


top
B47:
Recognizing Gene/Protein Names in Biological Literature using Simple Contextual Features

Subject: Text Mining

Presenting Author: Qiong Wu, Purdue University

Author(s):
Michael Gribskov, Purdue University

Abstract:
Most currently available biomedical named entity recognition (NER) systems take machine learning approaches to train models on complex features collected from specific corpora. However, annotated corpora are not always available for a new domain. Here we explored the possibility of using contextual features and supportive web evidence with Support Vector Machine (SVM) to identify gene/protein names in full-text literature. It has been shown in many NER systems that words surrounding gene/protein names are important in term recognition. We collected 35,384 sentences from 724 full-text articles that have at least one of 1153 pre-identified gene/protein names, and defined a word window of size 5 centered at the gene/protein. 43 contextual features with different word stems were selected from meaningful words that occur most frequently in the 5-word windows. During feature evaluation, instead of using 1 or 0 for the presence of selected contextual words in the neighborhood, we utilize supportive web evidence returned by Yahoo! Search BOSS service. For any candidate term, each value of its contextual features is defined as the ratio of the number of web PDF documents that contain both the candidate term and contextual feature compared to the number containing only the candidate. An SVM model is then trained on each set of such ratio vectors. TF-IDF values of candidate terms are considered to remove false positives. Our system’s performance is comparable to ABNER on unseen texts and achieves an F1-score of 0.496, while requiring only contextual features and allowing simple adaptation to any corpus.


top
B48:
Identifying functional groups within the vast antigenic diversity of parasites with high recombination rates: An example from Plasmodium falciparum

Subject: Sequence Analysis

Presenting Author: Mary Rorick, University of Michigan

Author(s):
Edward Baskerville, University of Michigan
Mercedes Pascual, University of Michigan AND Howard Hughes Medical Institute
Donald Chen, New York University School of Medicine
Karen Day, New York University School of Medicine

Abstract:
The genome of Plasmodium falciparum, the primary cause of malaria in humans, contains a multi-copy set of antigenic genes, known as var. These 50-60 var genes are expressed in a mutually exclusive manner, each encoding a variant of the PfEMP1 protein. This protein is trafficked to the surface of infected erythrocytes and it is thought to be the dominant target of the human immune response. PfEMP1 variation is apparently shaped by strong diversifying selection since thousands of var sequence variants often coexist within small local populations. The immense diversity within and between parasite genomes accounts for the remarkable persistence and recurrence of infections within individual hosts. It appears that var gene diversity is achieved through efficient recombination of ancient sequence diversity rather than through high mutation rate. By developing network and model-based methods to analyze var diversity, we aim to identify functional groups of sequences, which we will then use to characterize the structure of sequence diversity at local and global scales. This structure reflects both the functional roles constraining diversity and the mechanisms generating diversity in this important class of antigenic genes. These aspects of var evolvability may have important bearing on intervention strategies aiming at malaria control or eradication.


top
B49:
Investigating the Prevalence of CRISPRs in Bacterial Genomes

Subject: Sequence Analysis

Presenting Author: Michael Shaffer, Loyola University Chicago

Author(s):
Catherine Putonti, Loyola University Chicago

Abstract:
CRISPR sequences are a way that prokaryotic organisms have developed defenses against bacteriophages. Prokaryotes are able to recognize nucleic acid sequences from invading viruses, incorporate them into their own genomic DNA as spacers in larger sequences and then use them in the future to defend themselves form attackers with similar sequences. This proposes to create a three step algorithm to taken in sequence data, search for similar sequences, and then check the functionality of subsequences in viruses associated with CRISPR sequences . The major challenge of this endeavor is find a balance between performance and accuracy in the search for close matches. Multiple search techniques are examined each with unique speed and memory benefits. In addition there is no currently known association between CRISPR spacer sequence and the function of this sequence in the virus from which it is derived.


top
B50:
Using Computational and Experimental Tools to Discover New Elements in the Clover Genome

Subject: Sequence Analysis

Presenting Author: Alexander Sbrocchi, Loyola University Chicago

Author(s):
Howard Laten, Loyola University Chicago

Abstract:
Retrotransposons constitute the majority of the protein coding regions of most eukaryotic genomes. Most genomes carry tens to thousands of retrotransposon copies derived from dozens of distinct families, but most, if not all of these copies are non-functional and contain disabling mutations, including large numbers of indels. Regions rich in these elements have, until recently, been ignored in all but the most complete genome sequencing projects. Many repetitive DNA families, such as those in the genus Trifolium, can be pieced together from hundreds of short overlapping DNA sequence fragments that exist on separate clones that have been deposited in Genbank databases containing BAC-end sequences. The results are hypothetical sequences that encode fully functional elements with in tact open reading frames and other conserved features. These sequences provide the basis for when, during the history of native and/or synthetic allopolyploid Trifolium, retrotransposon insertion occurred.


top
B51:
Data Mining Tecniques for Interpreting Metagenomic Sequence Data

Subject: Databases & Ontologies

Presenting Author: Gina Kuffel, Loyola University Chicago

Author(s):
Catherine Putonti, Loyola University Chicago

Abstract:
Advances in the amount of data which can be generated by next-generation sequencers provide us with a unique opportunity to assess not only the microbial members present within environmental samples, but also the genes being expressed. One of the current limitations to the new sequencing technology is referred to as the ‘read mapping problem. Currently, available tools work best when there is little variation between the sequencing read and the reference genome for which the reads are being mapped. Variation, however, exists greatly for microbial sequencing projects. Recently, we have developed new software specifically for the mapping of short reads. The value of this whole pipeline is in the analysis stage for which I will be focusing my efforts. In order to analyze and interpret the vast amount of data delivered by high throughput sequencing it is necessary to automate the processes using bioinformatics techniques and available software. The Gene Ontology database will be used in conjunction with the Bioperl software package to assign gene function that appears in a metagenomic sample.


top
B52:
Successful Clinical Application of Diagnostic Whole Genome Sequencing; Tools and methodologies

Subject: Sequence Analysis

Presenting Author: elizabeth worthey, The medical college of wisconsin

Author(s):
george kowalski, MCW
brandon wilk, MCW
jeremy harris, MCW
weihong jin, MCW
bradley taylor, MCW
marek tutaj, MCW
jeff de pons, MCW
mary shimoyama, MCW
howard jacob, MCW

Abstract:
At the Medical College of Wisconsin we began deploying genomic sequencing in the clinic in the latter half 2009 to end a diagnostic odyssey in a specific child. We continue to offer this service as a clinic tool to end such odysseys for other specific, very ill individuals being seen at the Children’s Hospital and Health System of Wisconsin. This presentation will focus on our analysis strategy and tools, but will also cover our recent findings and cover a number of lessons learned associated with performing analysis of whole genome sequences in the clinic.


top
B53:
Microarray Deconvolution Using Low Rank Approximation

Subject: Gene Regulation & Transcriptomics

Presenting Author: Chao Wang, The Ohio State University

Author(s):
Kun Huang, The ohio state university
Raghu Machiraju, The Ohio State University

Abstract:
Motivation:High throughout gene expression profiling plays significant role in discovering genetic evidence of diseases, especially cancers. However, traditional Significance Analysis of Microarrays (SAM) studies cannot distinguish the variance between the effect of difference of cell population in a measured sample and the actual transcript abundance of this patient. Computational deconvolution of microarray data into different cell types is difficult, especially without pre-known information about the population of cell types.Problem statement:We present a compressive sensing based approach for deconvolving high throughput gene expression data from heterogeneous tissue samples.Approach:The problem of deconvolving microarray data into multiple cell-types is modeled as a three-step approach: first, because the microarrays have to be deconvolved as multiplication of the proportion matrix and the gene expression for each cell-type, low-rank approximation methods are applied to obtain the multiplication. Then, we unbiasly select candidate cell type specific genes from literatures, select highly relevant cell-type specific genes before we factorize the data. Finally, gene enrichment tests are conducted based on different cell-type gene expression profiles between groups of samples, so that gene markers can be discovered in each individual cell-types.Results:The validation of this method on the public dataset shows its capability of estimation accurate proportions without knowing the individual expression levels in each cell line. Conclusions:The method can be extended to any other microarray dataset, to assist the analysis of development of cancers. More effective regularization method will improve the computational deconvolution.


top
B54:
Gene Clustering based on Neighborhood Information

Subject: Algorithm Development and Machine Learning

Presenting Author: Nan Meng, The Ohio State University

Author(s):
Chao Wang, The Ohio State University
Kun Huang, The ohio state university

Abstract:
Understanding raw data of gene expressions is far beyond people's ability, hence gene clustering is usually the first step in analyzing gene data.
The clustering result is critical for any analysis afterward, and distance metric is the key to success. It is very common to use
expression values as coordinates to project genes into high dimensional space and apply clustering algorithms based on distances between points.
Since the spatial structure of the gene is unclear, some genes that share common features may not result in the same cluster.

To better describe each gene and classify them into a meaningful group, we propose using the neighborhood information as the similarity metric for
gene clustering, which could lead to detection of biological function connections between genes that is not clear in other methods. We use values of Ripley's K function,
a widely used spatial statistics technique, to describe the neighborhood information as feature space, then apply K means clustering. The results show that certain
gene function was only detected in our cluster results, compared with clustering based only on expression values. Since this method does not
assumes a structure assumption of spatial data structure, it detects functions that may be neglected by others.


top