Poster Presentations


A01:
A parallel constraint-based algorithm for Bayesian network learning

Subject: Algorithm Development & Machine Learning

Presenting Author: Mingyi Wang, The Samuel Roberts Noble Foundation

Abstract:
Bayesian networks have been used in modeling gene regulatory networks in bioinformatics. However, the computational complexity of this mode is a bottleneck to it being applied to large amounts of experimental data from biological systems. To address this problem, we present a parallel algorithm for a constraint-based optimal structure search of Bayesian networks. We consider the original PC algorithm as the basis to implement parallel computing. The iterative pair-wise conditional independence (CI) tests in the classic PC algorithm are easily distributed on multiple processors due to the independence of computing each pairwise interaction from all others. Besides parallel computing, two additional modifications were applied to further improve the efficiencies and performance of this algorithm. One is that only low-order CI tests are carried out in the first phase and neighbor numbers of related node pairs are checked in the second phase to guarantee directionality inference is still correct. The second one is that all the adjacent nodes of all nodes are stored in an array before parallel CI tests with a higher order are carried out. This means that an edge deletion no longer affects the adjacent nodes used for CI tests within the same level. We have run performance tests in simulation studies and three gold-standard datasets from DREAM5. We demonstrate that the speed-up scales with the number of processors almost linearly. The performance tests showed that our algorithm is able to scale up to large datasets with several thousand genes and is promising for identifying causal relationships.


top
A02:
Probabilistic Inference of Graphs from Observed Diffusion Dynamics

Subject: Algorithm Development & Machine Learning

Presenting Author: emre sefer, Carnegie Mellon University

Abstract:
We tackle the problem of inferring graph edges if we can only observe an SEIR difussion process spreading over the nodes of a graph. This problem is of importance in the common case where node states can be estimated with less cost than edges can be found. Common applications include inferring a contact network from disease spread data, or inferring a reference network from idea
spreading. We improve upon existing approaches for this problem in four ways:
(1)~we assume we are provided only with probabilistic information about the
state of each node; (2)~we assume we may not sample at the same timescale at
which the difussion process is occurring; (3)~we present a more general framework that better uses trace data to model edge non-existence; and finally, (4)~we extend the method to infer dynamically rewiring networks. No previous
work handles the case of under-sampled traces or traces sampled with uncertainty. Experiments on both real and synthetic data show our method is
accurate even under these challenging cases.


top
A03:
A graph-based optimization system for comparative genomics of bacterial regulatory networks

Subject: Algorithm Development & Machine Learning

Presenting Author: Sefa Kilic, University of Maryland Baltimore County

Author(s):
Ivan Erill, University of Maryland Baltimore County, United States

Abstract:
The lack of experimental data on transcription factor-binding sites and the exponentially increasing amount of sequence data make comparative genomics an attractive tool for the analysis of transcriptional networks in bacteria. The large phylogenetic distances and the variability in binding motif and network composition associated with bacteria, however, have complicated the development of comparative genomics techniques. In this work, we introduce a comparative genomics method to simultaneously identify motifs and map regulons in multiple species using a systematic approach that operates under minimal assumptions on regulon/motif conservation. The evolutionary relationship among species is captured by representing the system as a graph model, where each node corresponds to a species and edges define the phylogenetic relationships between species. Multiple sources of experimental knowledge can be introduced as nodes in the network, and they contribute species-specific information on known motifs and regulons. The fitness of any given graph configuration depends on the observed similarity between individual motifs and regulons, as well as the quality of individual motifs. The gradual improvement of motifs and regulons across multiple species is hence formulated as a generic optimization problem that aims at maximizing the fitness function. We report the benchmarking of the system against available experimental data on three main bacterial transcription factors (LexA, CRP and Fur) across multiple bacterial groups. We analyze the impact of different optimization methods and the inclusion of multiple constraints and sources of information. We also analyze the biological relevance of predicted regulons for these three transcription factors.


top
A04:
IIID VGAS a visual genome analysis studio incorporating high performance next generation sequence analysis.

Subject: Algorithm Development & Machine Learning

Presenting Author: Shay Leary, Institute for Immunology & Infectious Diseases

Author(s):
Don Cooper, Institute for Immunology & Infectious Diseases, Australia
Abha Chopra, Institute for Immunology & Infectious Diseases, Australia
Mark Watson, Institute for Immunology & Infectious Diseases, Australia
Simon Mallal, Institute for Immunology & Infectious Diseases, Australia

Abstract:
Background
Bioinformaticians have gained extensive knowledge and accumulated diverse sets of skills in the use and understanding of a wide range of genome alignment and analysis tools. Many of these tools have been developed by computer scientists with minimal domain expertise, or by experts in the bioinformatics field lacking strong graphical user interface design skills. This project has been funded by a world class institute and aims to combine ideas from experts in the appropriate fields to develop a solution which combines cutting-edge genomic based algorithms with an enhanced visual user experience.

Results / Conclusions
Based on this expertise, a visual genome analysis studio (IIID VGAS) was specifically developed with the ambition of setting a new benchmark in genomic alignment visualization. Furthermore, VGAS offers an expanding suite of post-alignment tools presenting all data visually allowing for further analysis and the interpretation of results. To date, the set of integrated tools include various scalable viewers (e.g. reference genome and epitope mapping) and high performance sequence aligners (supporting FASTA, Sanger, and Next-Gen sequencing). References can be imported directly from various publicly available repositories (Genbank, Ensembl, IMGT) or loaded from the VGAS core database, the design structure is the result of over 5 years collaboration between scientists and software engineers.


top
A05:
Modeling Mechanotransduction Signaling through Actin Filament Network Deformation Linked to Biochemical Response

Subject: Algorithm Development & Machine Learning

Presenting Author: John Kang, Carnegie Mellon University

Author(s):
Kathleen Puskar, Point Park University, United States
Philip LeDuc, Carnegie Mellon University, United States
Russell Schwartz, Carnegie Mellon University, United States

Abstract:
Mechanical forces are increasingly becoming understood as playing a crucial role in biological phenomenon in fields ranging from stem cell differentiation to cancer metastases to vasculogenesis. The links between mechanical input, the corresponding morphological changes in the actin cytoskeleton, and the resulting biochemical response are not yet well understood. Resolving them is a significant challenge in the field of mechanotransduction. Here, we present a model that integrates actin filament network remodeling under stretch with a novel biophysical model of molecular release to further elucidate the interplay between actin network morphology and resultant biochemical signaling.

As stretch is applied to our model of the discrete actin filament network, the distribution of individual bond angles in the network transitions from a more peaked to a flatter distribution. We used our approach to explore various angle thresholding models of how actin filament network deformations might influence rates of release of bound signaling molecules. These models allow us to project how a biochemical response might appear from a given applied mechanical stimulus. We validate these simulations using published experimental data and use our model to then test the predictive capabilities of different hypotheses for how mechanotransduction may function. Our model for the mechanotransductive release of signaling factors represents a potentially versatile mechanistic platform for examining biophysical interactions that link mechanical stimulus at the cellular level to response at the protein level.


top
A06:
Automatic Reconstruction of Cellular Metabolic Networks by Combination of Probabilistic Graphical Models and Knowledge-based Methods

Subject: Biological Networks

Presenting Author: Qi Qi, University of Missouri-Columbia

Abstract:
Automatic reconstruction of cellular metabolic networks has been a
challenging problem for many years in computational biology. Traditionally organism-specific metabolic pathways are generated from reference pathways in authentic databases, e.g. KEGG, via genome annotation. However, the deficiency of
this mapping-based method is that it largely depends on the accuracy of genome annotation, and it also lacks capability of predicting for unknown interactions. Also the resulted reconstructed networks tend to have gaps that caused by missing gene products from genome annotation of target organisms, particularly for unknown species. In contrast,
computational methods can predict networks of gene or gene products
relationships through data reverse engineering. But its
drawback is in short of reliability as opposed to the reference mapping-based method.
The motivation of this work was to create such a system for
automatic reconstruction of metabolic networks that
would mix in the merits of these two types of methods.
Specifically, we built a knowledge base including gene
products interactions and metabolic reactions out of existing reference
pathway networks. The knowledge then would be served as constraints
for Bayesian networks structure learning methods to predict metabolic
pathways.

We tested the approach to predict for a set
of $62$ yeast metabolic networks in KEGG database, as
against the mapping-based method. The comparison between
predicted and ground truth pathways was based on their
underlying gene products relationships. The results showed favorably over the
computationally knowledge-driven approach.


top
A07:
Comparative analysis of yeast interactome versus human tissue-specific networks

Subject: Biological Networks

Presenting Author: Shahin Mohammadi, Purdue University

Author(s):
Baharak Saberidokht, Purdue University, United States
Ananth Grama, Purdue University, United States

Abstract:
Budding yeast, Saccharomyces erevisiae, has been used extensively as a model organism for cellular processes in evolutionarily distant species, including humans. Different tissues in humans, while inheriting a similar genetic code, exhibit unique characteristics and functions. This raises the natural question of the
extent to which yeast provides a reliable model for different human tissues. To address this question, we construct the first network of tissue-tissue similarities using the tissue-specific gene expression signatures. We show that this network clusters functionally related tissues together. We use this network to faithfully
reconstruct the differentiation tree of immune cells from hematopoietic stem cells. Next, we perform a comparative network analysis of human tissue-specific networks and the yeast interactome. Using a rigorous statistical model, we identify the
subset of tissues that manifest significant similarity to yeast, along with a complementary subset that does not. Finally, we show that tissues that are close in the tissue-tissue similarity network are consistent in terms of their similarity to yeast. Our results provide valuable bases for projecting hypotheses from yeast to humans. Specifically, they provide a reliable subset of tissues amenable to cross-species generalizations.


top
A08:
MetDisease: An Annotation of Metabolites to Diseases

Subject: Biological Networks

Presenting Author: Pervis Fly, University of Michigan

Author(s):
Terry Weymouth, University of Michigan, United States
Alla Karnovsky, University of Michigan, United States

Abstract:
Metabolomics as an emerging field allows for the large scale analysis of metabolites in biofluids and tissues as well as characterizing metabolic differences between health and disease. Metabolites can serve as potential markers for a disease. The majority of existing biological annotations for endogenous metabolites come from pathway databases. However, the pathway databases include only about half of metabolites that can be experimentally detected in biological samples. In our previously developed tool Metab2MeSH we used Medical Subject Headings (MeSH) to connect metabolites to functional information in the published literature (http://metab2mesh.ncibi.org/). Our new tool, MetDisease, focuses on disease MeSH keywords, and is implemented as a plugin for the network analysis and visualization tool Cytoscape. The Metab2MeSH annotation only allows for searching one metabolite at a time; MetDisease allows multiple metabolites to be visually inspected at once for association with multiple diseases. Currently, MetDisease can be used to explore the metabolites initially annotated by Metab2MeSH to diseases. Two types of network graphs can be constructed in this plugin: 1) diseases, represented by nodes, are connected by the metabolites shared among them; 2) metabolites, represented by nodes, are connected by the diseases shared among them. In the future, users will be able to build metabolic networks and annotate them with MeSH disease terms.


top
A09:
Constructing Protein Affinity Networks: A Game-Theoretic Approach

Subject: Biological Networks

Presenting Author: Brittney Hinds, University of Nebraska-Lincoln

Author(s):
Bo Deng, University of Nebraska-Lincoln, United States
Etsuko Moriyama, University of Nebraska-Lincoln, United States

Abstract:
A protein domain is a part of a protein sequence that can function and evolve independently of the rest of the protein. Proteins often include multiple domains. Within a multidomain protein family, domain organization (composition and order of domains) often varies significantly among member proteins. With dynamic evolutionary events such as domain shuffling, protein evolution is reticulate. As a result such reticulated proteins need be represented in a network. To construct such conceptualized protein networks in terms of their domain architectures, we present a game-theoretic model where the key assumption is that evolution seeks to maximize two competing objectives to preserve the shared similarities of member proteins and simultaneously to diminish such affinities through domain diversification.


top
A10:
Toward Mutation-Based Predictors of Patient Outcomes in Myeloid Malignancies

Subject: Clinical Informatics & Epidemiology

Presenting Author: Matthew Ruffalo, Case Western Reserve University

Author(s):
Thomas LaFramboise, Case Western Reserve University, United States
Mehmet Koyuturk, Case Western Reserve University, United States
Jaroslaw Maciejewski, Translational Hematology & Oncology Research, United States
Holleh Husseinzadeh, Translational Hematology & Oncology Research, United States
Hideki Makishima, Translational Hematology & Oncology Research, United States

Abstract:
Human myeloid malignancies comprise a highly heterogeneous class of diseases whose pathogenesis is poorly understood and whose current classification is often not concordant with prognosis. Included among myeloid malignancies are myelodysplastic syndromes (MDS) and myelodysplastic/myeloproliferative neoplasms (MDS/MPN), both of which frequently transform to acute myeloid leukemia (AML). Efforts to improve accuracy in the classifification of myeloid malignancies have eveolved and grown over the years. However, none of the classification schemes incorporate somatic mutations, despite several studies showing associations between presence of mutations in specific genes and clinical features. Advances in sequencing technologies now make it possible to detect protein altering somatic mutations in a near-comprehensive manner. In AML and MDS, large-scale sequencing efforts have uncovered a number of recurrently-mutated genes. However, a single AML genome typically carries some 200--800 somatic mutations, and with this deluge of data comes a huge preponderance of "passenger" mutations that need to be filtered out in order to pinpoint the cancer-relevant "driver" mutations. The task of sifting through mutations and uncovering clinical associations is well-suited for statistical classification methods that have been developed over the years. In the current study, we analyzed whole-exome data from 398 patients with myeloid malignancies, including those classified as MDS (N = 104), MDS/MPN (N = 47), and AML (N = 229) under current diagnostic systems. Our hypothesis was that systematic querying of mutational status across 357 genes can produce a more accurate predictor of survival than current standards.


top
A11:
Function theoretical analysis of high content data

Subject: Bioimage Analysis

Presenting Author: Timothy Lezon, University of Pittsburgh

Abstract:
Cellular response to external stimuli encodes information on the network of chemical interactions that govern cellular behavior. This response can be measured by fluorescently labeling a subset of cellular biomolecules and imaging the cells in what is called a high content assay (HCA). Automated analysis of HCA data can identify distinct cellular phenotypes, defined by the levels and localization of the measured biomarkers. One challenge in analysis of HCA data is extracting information on the biochemical differences between cells in distinct phenotypes from the heterogeneous response of a population to external stimuli. Here I introduce a function theoretical method for describing the phenotype distributions in cellular populations and demonstrate its utility in identifying biochemical differences between cell lines using HCA data. The underlying idea, that cellular population distributions can be described as sums of orthogonal functions, gives rise to a natural metric for quantifying the difference between cellular populations. As it uses unprocessed HCA data, this technique has a straightforward biological interpretation and can be applied to any HCA dataset.


top
A12:
Efficient Modeling and Simulation of Bacteria Networks with BNSim

Subject: Algorithm Development & Machine Learning

Presenting Author: Guopeng Wei, Carnegie Mellon University

Abstract:
Bacteria networks refer to nanoscale biological networks interconnected using the native or engineered machineries of bacteria (e.g. chemoreceptors, photoreceptors, flagella motors). This definition includes the microscale transportation system using chemotactic bacteria to delivery molecular cargos to the targets, such as a drug delivery system. At the same time, due to recent advances in synthetic biology, the bacteria networks have been extended to incorporate the transcription network inside bacteria, namely the modular genetic circuits for intercellular interactions like quorum sensing or light communication. To characterize the dynamics of bacterial networks accurately, we introduce BNSim, an open-source, parallel, stochastic, and multiscale modeling platform which integrates various simulation algorithms, genetic circuits and chemotactic pathway models, together with a 3D complex environment. Moreover, we show how this platform can be used to model synthetic bacteria consortia which implement a XOR function and assemble nearby bacteria using light communication. Consequently, the results drawn with our platform show its ability to predict various properties of realistic bacterial networks and provide guidance for their actual wet-lab implementations.


top
A13:
An agent-based model of mouse spermatogenic cycles

Subject: Algorithm Development & Machine Learning

Presenting Author: Debjit Ray, Washington State Univeristy

Abstract:
Spermatogenic cycle describes the periodic development of male germ cells, which occurs every 8.6 days in mouse. Each cycle is divided into Stages I to XII based on well-defined associations of germ cells on seminiferous tubule. The periodic patterning of germ cells results from cellular events including differentiation, proliferation, apoptosis, and movement. However, the precise action of germ cells that leads to the emergence of different cellular patterns remains undefined. We develop an agent-based model (ABM) to simulate the mouse spermatogenic cycle. ABM is a suitable approach for studying dynamic systems in which individual heterogeneity and spatial interactions are important. The model depicts a tubule cross-section in a regular grid. Ten types of germ cells are included ranging from spermatogonia to spermatids. Kinetic parameters for differentiation, proliferation, apoptosis, and movement, are estimated from static and dynamic imaging and irradiation experiments. The dynamic global pattern of germ cell organization on a cross-section is achieved from the local, individual cellular behaviors. By manipulating cellular events either individually or collectively in silico, the model allows us to predict the events to the abnormal morphology observed in various genetic and environmental perturbations. In summary, our model elaborates the temporal-spatial dynamics of germ cells, allowing us to trace individual cells as they change state and location. More importantly, the model provides a mechanistic understanding of how tissue morphology and sperm production are achieved. Our study may open new possibilities for manipulating cellular behaviors and interactions to restore the continual production of sperm.


top
A14:
Machine Learning Classification of Non-Small Cell Lung Cancer Subtypes from Gene Methylation Data

Subject: Algorithm Development & Machine Learning

Presenting Author: Arturo Lopez Pineda, University of Pittsburgh

Author(s):
Shyam Visweswaran, University of Pittsburgh, United States
Gregory F. Cooper, University of Pittsburgh, United States
Vanathi Gopalakrishnan, University of Pittsburgh, United States

Abstract:
Lung cancer is the leading cause of human cancer death in the US. Despite extensive research, the mechanisms that lead to different types of lung cancer remain uncertain. Adenocarcinoma and Squamous Cell Carcinoma are the most common forms of non-small cell lung cancer in the US. DNA methylation status of CpG islands is crucial to understand the epigenetic regulation of genes. We selected 311 Lung Cancer samples (150 Adeno, 161 Squamous) from The Cancer Genome Atlas. All samples were originally processed and normalized using the Illumina Infinium Human Methylation 27 BeadChip platform. With this data we performed a machine learning classification analysis. First we used two different methods for discretizing the data –Fayyad and Irani’s MDLPC and Efficient Bayesian Discretization (EBD). Then we performed a stratified 5-fold internal cross-validation classification using a Naïve Bayes Classifier (NB), Efficient Bayesian Multivariate Classifier (EBMC), and Bayesian Rule Learner (BRL). Our evaluation metric was area under the receiver operator characteristic (AUROC). The average AUROCs are very high (over 90%), which gives us a good starting point for investigating in the future the impact of these strategies in the classification of lung cancer subtypes using other datasets.


top
A15:
HMMvar: Predicting the Functional Effects of Indels and SNPs Based on HMM Profiles

Subject: Algorithm Development & Machine Learning

Presenting Author: Liqing Zhang, Virginia Tech

Author(s):
Mingming Liu, Virginia Tech, United States
Layne T Watson, Virginia Tech, United States

Abstract:
With the development of sequencing technologies, more and more sequence variants are available for investigation. Different classes of variants in the human genome have been identified, including single nucleotide substitutions, insertion and deletion, and large struc- tural variations such as duplications and deletions. Insertion and deletion (indel) variants comprise a major proportion of human genetic variation. However, little is known about their effects on humans. The absence of understanding is largely due to the lack of both biological data and computational resources. This paper presents a new indel functional prediction method HMMvar based on HMM profiles, which capture the conservation in- formation in sequences. The results demonstrate that a scoring strategy based on HMM profiles can achieve good performance in identifying deleterious or neutral variants for dif- ferent data sets, and can predict the protein functional effects of both single and multiple mutations.


top
A16:
An Accurate Matlab Based Tool for Inverted Repeat Detection

Subject: Algorithm Development & Machine Learning

Presenting Author: SREESKANDARAJAN SUTHARZAN, Miami University

Author(s):
Michelle Flowers, Miami University, United States
John Karro, Miami University, United States
Chun Liang, Department of Botany, United States

Abstract:
DNA palindromes are DNA sequences that are identical to their own reverse complement, a characteristic which can lead to DNA secondary structures such as hairpins, bulges and cruciform -- structures involved in many biological reactions. For example: the formation of the palindrome-base hairpin structure plays a role in transposition reactions; the DNA cruciform structure may be involve in the initiation of DNA replication through its ability to change the supercoiling level; and DNA palindromes can act as the binding sites of regulatory factors of dimeric nature. Inverted Repeats (IRs) are considered to be perfect palindromes and detection of them can facilitate our understanding of many biological analysis processes that involve them. The ability to identify palindromic sequences is important in the study of such reactions, but many of the available tools, including Matlab bioinformatics toolbox’s palindromes function, do not perform well in detecting nested and overlapping IRs. We propose a new computational method, implemented as a Matlab-based tool, for IR detection at increased sensitivity. The algorithm scores each nucleotide position based on a prime number scoring system in a given DNA sequence, relying on the periodicity of relatively prime numbers to identify the presence of IRs in specific regions. The potential IR sequences are detected based on the score distribution and the valid inverted repeats are found using a filtering step. The proposed tool can aid in genome-wide analyses of nested and overlapping IRs, ultimately giving the community more power to understand the role of these repeats in genomic processes.


top
A17:
A Novel Rule-based Method for Detecting Differential Splicing Events Using RNA-Seq

Subject: Algorithm Development & Machine Learning

Presenting Author: Nan Deng, Wayne State University

Author(s):
Dongxiao Zhu, Wayne State University, United States

Abstract:
It is reported that more than 90% of human genes are alternative spliced via different mechanisms. The increasing available RNA-Seq technology provides unprecedented opportunities for detecting differential pre-mRNA alternative splicing between different transcriptomes. In addition to differential expression analysis, differential splicing analysis may yield new understanding of cell development and differentiation. We present a new computational method for detecting differential alternative splicing events (DASEs) between transcriptoms using RNA-Seq data. Our method detects significant DASEs between transcriptoms in the form of alternative splicing modules (ASMs) using a parametric statistical test and a rule-based multiple change-points analysis on each ASM. Our method detects both differential splicing and the splicing mechanisms. We applied our method to detect DASEs on a public RNA-Seq dataset of H1 and H1 differentiated into neural progenitor cell lines. We detected many significant DASEs falling into the five well-known alternative splicing mechanisms. The results demonstrate that our method as a promising approach for detecting DASEs from RNA-Seq data.


top
A18:
Coalescent-based Estimation of Population History in the Presence of Admixture from Large Genetic Variation Data

Subject: Algorithm Development & Machine Learning

Presenting Author: Ming-Chi Tsai, Carnegie Mellon University

Author(s):
Guy Blelloch, Carnegie Mellon University, United States
R. Ravi, Carnegie Mellon University, United States
Russell Schwartz, Carnegie Mellon University, United States

Abstract:
Understanding how modern human populations have arisen, dispersed, and intermixed since we emerged as a species is an important but challenging problem. To infer how the human population is structured and how different subgroups are related to one another is not only a fundamental issue in population genetics, but also has practical relevance to improving models of genome evolution. Despite considerable attention to the general problem of identifying population substructure in large-scale variation data, the field lacks automated methods for reconstructing the relationships among population subgroups and inferring correct orders and timing of events in the presence of admixture. We describe here a novel two-step approach for inference of quantitative population history in the presence of admixture from large variation datasets. The method first identifies a set of phylogenetic splits that are likely have occurred during the evolutionary history and assign each observed variation site to one of the splits. The resulting split set and the number of variation sites assigned to each split are then used to infer a model of population-level evolution describing times of the divergence and admixture events as well as admixture proportions using a coalescent-based Markov chain. Evaluation on simulated three- and four-population data sets suggest that fully automated reconstruction of population histories in the presence of admixture is feasible, although further algorithmic improvements may be needed to infer more complicated scenarios.


top
A19:
A probabilistic method for RNA-Seq read error correction

Subject: Algorithm Development & Machine Learning

Presenting Author: Marcel Schulz, Carnegie Mellon University

Author(s):
Hai-son Le, Carnegie Mellon University, United States
Brenna McCauley, Carnegie Mellon University, United States
Veronica Hinman, Carnegie Mellon University, United States
Ziv Bar-Joseph, Carnegie Mellon University, United States

Abstract:
Sequencing of RNAs with next generation sequencing technologies (RNA-Seq) has revolutionized the field of transcriptomics for genetics and medical research. RNA-Seq experiments are routinely applied to study mRNAs, miRNAs, and other short RNAs in a diverse range of organisms. Error correction of RNA-Seq data is an important research direction to improve data analysis. Specifically, error-sensitive analyses such as de novo transcriptome assembly and detection of RNA editing events may benefit from sequencing error correction. Existing methods are ad-hoc or originally developed for genomic sequencing data.
In this work, we devise the first general method to remove sequencing errors from RNA-Seq reads. Removal of sequencing errors in RNA-seq data is challenging because of the overlapping effects of non-uniform abundance, polymorphisms, and alternative splicing (mRNAs). We present the SEECER algorithm based on a formulation of probabilistic profile Hidden Markov Models that addresses all the above-mentioned challenges. We show that SEECER reduces the amount of sequencing errors, significantly increases the performance of downstream analyses with or without available reference sequence, and vastly outperforms ad-hoc approaches that researchers currently use.


top
A20:
Efficient Stochastic Simulation of Chemical Kinetics Networks Using A Weighted Ensemble Of Trajectories

Subject: Biological Networks

Presenting Author: Rory Donovan, University of Pittsburgh

Author(s):
Andrew Sedgewick, University of Pittsburgh, United States
James Faeder, University of Pittsburgh, United States
Daniel Zuckerman, University of Pittsburgh, United States

Abstract:
We apply the ``weighted ensemble'' (WE) simulation strategy, previously employed in the context of molecular dynamics simulations, to a series of systems-biology models that range in complexity from one-dimensional to a system with 354 species and 3680 reactions. WE is relatively easy to implement, does not require extensive hand-tuning of parameters, does not depend on the details of the simulation algorithm, and can facilitate the simulation of extremely rare events.

For the coupled stochastic reaction systems we study, WE is able to produce accurate and efficient approximations of the joint probability distribution for all chemical species for all time t. WE is also able to efficiently extract mean first passage times for the systems, via the construction of a steady-state condition with feedback. In all cases studied here, WE results agree with independent calculations, but significantly enhance the precision with which rare or slow processes can be characterized. Speedups over ``brute-force'' in sampling rare events via the Gillespie direct Stochastic Simulation Algorithm range from ~10^12 to ~10^20 for rare states in a distribution, and ~10^2 to ~10^4 for finding mean first passage times.


top
A21:
Improving inference of rate parameters for viral capsid assembly systems

Subject: Algorithm Development & Machine Learning

Presenting Author: Lu Xie, Carnegie Mellon University

Author(s):
Gregory Smith, Carnegie Mellon University, United States
Russell Schwartz, Carnegie Mellon University, United States

Abstract:
Viral capsid assembly is a key model for complex self-assembly for which many important features, such as site-specific binding rates and detailed pathway information, remain elusive. Simulation-based data fitting algorithms can allow us to infer such features but significant computational challenges make it difficult to construct precise models and quantify uncertainty in the inferences. First, there is no closed form representation for the quality of fit of models to data, which therefore must be evaluated through computationally costly simulations. Second, the problem requires stochastic simulations, and the resulting simulation trajectories must be averaged over many replicates to suppress noise. Third, optimization of parameters must account for unknown factors in models and imprecision in experimental measurements. We have applied a heuristic optimization approach involving gradient descent and response surface fitting to fit quantitative parameters to the light scattering measurements of three in vitro viral assembly systems: human papillomavirus (HPV), hepatitis B virus (HBV), and cowpea chlorotic mottle virus (CCMV). This method has identified kinetic parameters for the three viruses that yield good fits to available data on bulk in vitro assembly. We have further applied derivative-free optimization (DFO) algorithms to to address computational challenges of fitting parameters to noisy data from costly stochastic simulations. Here we examine the model fits and their implied pathways for the heuristic algorithm, as well as preliminary applications of two DFO methods: stable noisy optimization by branch and fit (SNOBFIT), and multi-coordinate search (MCS).


top
A22:
Across the bridges and into the trees

Subject: Biological Networks

Presenting Author: Sandra Smieszek, Royal Holloway

Abstract:
Circadian clocks are ubiquitous and are found in bacteria, fungi, plants and animals.
They phase cellular processes and behavior to specific times of day and anticipate daily diurnal changes providing fitness advantage.The difficulty in elucidating the circadian clock is rooted in the notion that two independent factors contribute to controllability: the systems architecture (which components interact with each other) and the dynamic rules that capture the time dependent interactions between components. The identification of the principal actors (several have been characterized) inside a molecular and phenotypic network or inside a community is important to understand the topology of a complex network. Previously, we developed novel methods for the identification of circadian genes from short time-course microarray data and for the identification of the individual regulatory motifs which aggregate into coherent motif clusters capable of predicting the phase of a clock gene with high fidelity. Such motifs form the backbone of regulatory networks and play central role in defining its global topological organization. Here, we integrate gene expression profiles and protein interaction maps to provide a systematic and global view of combinatorial network modules underlying representative circadian programs. Furthermore we integrate the newly discovered cis regulatory modules into the circadian regulatory networks. This study forms the beginning of analytical framework that should allow one to study the controllability of a complex system like the circadian clock in plants through the combination of driver nodes with their time dependent control reflecting the systems dynamic logic. Such a network will provide a quantitative yet holistic outlook.


top
A23:
Genetic Diversity of the Milfoil Weevil, a native biocontrol agent for Eurasian Watermilfoil

Subject: Biological Networks

Presenting Author: Lara Roketenetz, University of Akron

Abstract:
Genetic data (COI mtDNA) for the native milfoil weevil, Euhrychiopsis lecontei, has been examined to determine the presence and extent of any geographically-structured genetic patterns. This information is lacking in the literature and is essential to the continued utilization of this weevil as a biocontrol agent for Eurasian watermilfoil (Myriophyllum spicatum), an invasive, aquatic macrophyte. Through my research 36 haplotypes (~850 base pairs of COI) have been identified from across the northern United States and Ontario, Canada. These data were examined in MEGA (Molecular Evolutionary Genetics Analysis) to construct a phylogenetic tree using Maximum Likelihood and in Network 4.6.1 to construct a haplotype network. Haplotype networks are used to graphically depict genetic diversity within a group. Interestingly, preliminary results show there may be two distinct groupings of E. lecontei – a Great Lakes watershed group and a Mississippi watershed group. Nested Clade Phylogenetic Analysis (NCPA) (Templeton, et al., 1995) although controversial (Petit, 2008), can be used prior to performing analysis with GIS (Geographic Information Systems) data to shed light on whether or not geographical associations of the haplotypes exist, and, if so, whether these associations are due to restrictions in gene flow or stem from historical events.


top
A24:
Evaluating correlation between genomic annotations and spatial closeness

Subject: Biological Networks

Presenting Author: Hao Wang, Carnegie Mellon University

Author(s):
Geet Duggal, Carnegie Mellon University, United States
Michelle Girvan, University of Maryland, United States
Sridhar Hannenhalli, University of Maryland, United States
Carl Kingsford, Carnegie Mellon University, United States

Abstract:
Recent chromosome conformation capture (3C) experiments result in a network of genome-wide chromatin interactions. We propose a variety of new statistics based on the topology of the interaction graph for testing whether a set of genomic loci, or a subset of which, are likely to be spatially close in three dimensions. We also introduce a framework for systematically evaluating the methods' ability for accurately assessing the spatial enrichment of a set of genomic loci. To construct spatial enrichment test cases, we search for different sizes of spatially compact cores within the real 3D embeddings of Saccharomyces cerevisiae, and we define True Positive and True negative cases based on the ratio between the compact core size and the overall set size. We show that all tested methods perform well in spatial enrichment test. We also show that the method based on finding the maximum density subgraph can extract dense subgraphs that overlap well with the real compact cores of a given set. Further, the tested topological properties inside the dense core correlate well with the true spatial proximities. Among yeast features, we observe that the telomeres, which previous tests identified as spatially compact, are likely not. Conversely, we identified a set of previously studied breakpoint features that has a large compact core that the edge-fraction method fails to identify. Together, considering alternative topological features of 3C graphs can substantially improves our ability to correctly detect spatial functional enrichment.


top
A25:
Conceptualization of molecular findings by mining gene annotations

Subject: Databases & Ontologies

Presenting Author: Vicky Chen, University of Pittsburgh

Abstract:


top
A26:
Five Compartment Microbial Fuel Cell Model

Subject: Chemical Biology

Presenting Author: Joseph Gaone II, The University of Akron

Abstract:
Microbial fuel cells (MFCs) have attracted some attention in recent years as a possible source of clean energy. In this study a five chamber MFC computational model is developed. The primary goal is to test the efficiency of such a device. Microbes chosen for this MFC produce a nanowire matrix with conductivity that varies with pH level and biomass. A variable conductivity is a new contribution to the microbial fuel cell field as well as including all five chambers into one model. The effects of this new contribution are examined to understand their affects on the efficiency and optimization of this MFC. Biomass equations, chemical equations, and potential equations are developed to model the operation of each compartment. Numerics are utilized to solve the various sets of ordinary and partial differential equations. This model will provide greater insight into how an MFC operates with this particular type of microbe. A promising configuration that may one day lead to an efficient clean energy resource.


top
A27:
Feature Reductino using Gene Ontologies

Subject: Databases & Ontologies

Presenting Author: Anthony Deeter, University of Akron

Abstract:
Current feature selection methods use statistics (t-test, linear regression) to infer which features should be selected for use within a classifier. Statistical p-value corrections, from the extreme (Bonferroni), to the more conservative (Benjamini-Hochberg), make assumptions that the tests for gene expression level significance are not completely independent. They attempt to adjust for any dependency between these tests, but they fail to adjust for dependencies between the genes themselves. This can be accomplished by utilizing biological data we have gathered about the organisms we are trying to classify. The method of feature selection I propose utilizes both standard statistical methods and biological data found within the Gene Ontology Project in order to better understand how features interact with each other. This allows a better calculation of statistical significance with regard to feature selection and classification. while using a smaller set of data. Using three sets of human lung gene expression data (one healthy and two different types of lung cancer), I will use Analysis of Variance to select a preliminary set of statistically-significant genes to be considered for feature selection. I will use Bonferroni, Benjamini-Hochberg, and a combination of those two combined with my own method incorporating Gene Ontology groups to refine this list into four feature sets. I will classify these three sets using multiclass versions of Decision Tree, Support Vector Machine, and Nearest Neighbor classifiers and test each of the 12 total classifiers using 10-fold cross validation to compute confidence intervals.


top
A28:
Uncovering the genetic etiology responsible for non-syndromic oligodontia in a Caucasian family

Subject: Clinical Informatics & Epidemiology

Presenting Author: Yongsheng Bai, University of Michigan

Author(s):
Shih-Kai Wang, University of Michigan, United States
Jianyi Yang, University of Michigan, United States
Yang Zhang, University of Michigan, United States
James Simmer, University of Michigan, United States
Jan Hu, University of Michigan, United States
James Cavalcoli, University of Michigan, United States

Abstract:
Inherited dental defect is a collection of disorders affecting tooth number, size and structure. The frequency of such disorder varies depending on the type of defect and the specific population studied. Oligodontia or familial tooth agenesis, a dominant trait with variable expressivity and penetrance, has a reported frequency of 2-4% in the general population. Many genes, including MSX1, PAX9, AXIN2, WNT10A and EDA, when mutated are responsible for the non-syndromic oligodontia. Target gene analysis has been successful in less than 50% of the cases with oligodontia, which strongly suggests the possibility of epigenetic influence as well as the presence of additional candidate genes.
A Caucasian family with autosomal dominant, non-syndromic oligodontia was fully characterized based on family history, dental exam and radiographs. Following exclusion of all candidate genes, genomic DNA of four affected and four unaffected members were subjected to a custom exome sequencing variant annotation pipeline (SNPAAMapper) to identify and prioritize a list of candidate genes. We will present a computational approach to look at variant positions across all individuals to determine the frequency of called variants at those mapped positions and a scoring scheme to differentiate between reference and variant alleles in proportion to sequence depth. This will be part of the overall pipeline for variant functional determination and prioritization.
The application of an integrated algorithm combining exome sequencing, protein function/structure prediction and in vitro validation experiments may facilitate an accelerated discovery of a novel genetic etiology responsible for the oligodontia trait observed in the study family.


top
A29:
Oligonucleotide Microarray Analysis of leptin knockdown in developing zebrafish embryo (D. rerio)

Subject: Disease Models & Molecular Medicine

Presenting Author: Mark Dalman, University of Akron

Author(s):
Anthony Deeter, University of Akron, United States
Zhong Hui Duan, University of Akron, United States
Richard Londraville, University of Akron, United States

Abstract:
Leptin is a 16 kD, adipocyte-derived hormone originally identified in mammals almost 20 years ago. Since its discovery, ~40000 articles have been published on its pleiotropic effects with less than 10% of these focused on non-mammals. Furthermore, the availability of non-mammalian, leptin null mutants has been lacking and transcriptomic analysis almost nonexistent. We have recently characterized the knockdown of leptin-A and its receptor in early zebrafish (D. rerio) development, however very little is known of how leptin-A knockdown influences the transcriptome. Our study is the first to describe the transcriptomic response to leptin knockdown in the developing zebrafish embryo using PARTEK software and BLAST2GO. Our preliminary analysis indicates 2234 genes were differentially expressed with ribosome biogenesis the most significantly affected functional pathway. Phototransduction was surprisingly down regulated along with calcium signaling pathways indicating sensory and membrane depolarization may be drastically modulated. P53 signaling, MAPK, JAKSTAT, and TGFbeta signaling were all up regulated suggesting that the developing embryo is actively increasing cell growth, development, and vascular growth as indicated by increased VEGF expression despite in a state of perceived starvation. Interestingly leptin receptor expression was up regulated in the adipocytokine signaling pathway implying the organism is actively recruiting more leptin receptors to the surface to increase sensitivity.


top
A30:
Determining Low-Dimensional Embeddings in High-Dimensional Genotype Space for Tumor Phylogeny Reconstruction

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Theodore Roman, Carnegie Mellon University

Author(s):
Brittany Fasy, Carnegie Mellon University, United States
Amir Nayyeri, Carnegie Mellon University, United States
Gary Miller, Carnegie Mellon University, United States
Russell Schwartz, Carnegie Mellon University, United States

Abstract:
Cancer formation is an evolutionary system, driven by rapid diversification and selection of individual cells in the tumor cell population. As these tumor cells evolve from healthy cells within the body, they leaving behind remnant ancestral cell populations that one can use to reconstruct the process of evolution of the individual tumor. Better understanding the evolution and composition of tumors can in turn provide insights helpful for developing diagnostic and prognostic tools. We have employed a mixture modeling approach to tumor ancestry reconstruction, using algorithms based on computational geometry to reconstruct major tumor cell populations and possible evolutionary pathways from genomic profiles of individual tumors. Here, we describe progress on new methods for inferring the geometric structure of point clouds defined by tumor data sets in a high-dimensional space of gene expression or DNA copy numbers with the goal of reconstructing specific geometric structures expected to be indicative of particular evolutionary pathways. We specifically focus here on the problem of identifying low-dimensional simplicial subspaces within a higher-dimensional simplicial complex structure implied by models of tumor evolution. We describe methods for robustly determining the dimensions of subsets of tumor data, evaluated on a selection of simplex and simplicial complex data. We then consider future work to integrate these methods into the complete process of reconstructing the full geometry and interpreting it in terms of evolutionary models.


top
A31:
HGTector: a new approach to prediction of horizontal gene transfer based on Blast hit distribution statistics

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Qiyun Zhu, University at Buffalo, State University of New York

Author(s):
Katharina Dittmar, University at Buffalo, State University of New York, United States

Abstract:
Horizontal gene transfer (HGT) is prevalent in the microbial world and plays an important role in the evolution of microbial organisms. Multiple computational methods of HGT prediction have been developed based on sequence composition, explicit phylogenetic analysis, or best Blast match. Methods of different categories often give very different results, reflecting the significant limitations in the current methodology for HGT prediction.
We present a new method of rapid, exhaustive and genome-wide detection of HGT, featuring statistical analysis of Blast hit distribution patterns. Unlike previous Blast-based methods that typically focus on one best match for each gene, our method takes into account all hits of all genes in one or multiple genomes of interest. Each gene's Blast hit distribution is compared with the overall distribution pattern of the genomes. Genes that fall beyond a series of statistically determined thresholds are more likely to be horizontally acquired than being part of the core genome.
This method was tested on seven Rickettsia genomes. Its performance was benchmarked against six other popular methods from all three previously employed categories of HGT prediction. The result of our method shows overall high degree of overlap with the results of all these methods, comparing with the degrees of overlap between them. It suggests that our method is effective and unbiased.
We developed a computational pipeline, HGTector, to automate this method. We suggest that this program may be a useful tool for researchers to conduct rapid and comprehensive survey of HGT patterns in multiple newly sequenced genomes.


top
A32:
Characterizing the SOS response of the human microbiome

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Ivan Erill, University of Maryland Baltimore County

Author(s):
Patrick O'Neill, University of Maryland Baltimore County, United States
Joseph Cornish, University of Maryland Baltimore County, United States
Neus Sanchez-Alberola, University of Maryland Baltimore County, United States
Patrick O'Neill, University of Maryland Baltimore County, United States

Abstract:
Many bacteria respond to DNA damage by activating a regulatory network commonly known as the SOS response. In Escherichia coli, the SOS response has been shown to coordinate the expression of more than 40 genes involved in DNA repair, translesion synthesis and cell division inhibition. More recently, the SOS response has been linked to control of bacteriophage activity, dissemination of antibiotic resistance genes, integron recombination and toxin production. Recent efforts to map the human microbiome have generated an enormous wealth of data that can be analyzed to gain insight into the complex interactions among the genetic networks of gut bacteria. Here we use known data on the SOS response of Gram-positive and Gram-negative bacteria and conventional search methods to analyze the composition of the SOS response in the microbiome of healthy individuals. Our findings indicate that the SOS response is a key component of the shared microbiome gene network, linking multiple gene dissemination pathways. The gene function profile of the microbiome SOS response is remarkably similar to that observed in reference organisms, suggesting that it is conserved among gut bacteria. However, we report also an abundance of regulated toxin-antitoxin genes, highlighting the importance of plasmid-borne contributions to this system and the ability of metagenomic data to provide wide sample diversity and new information on regulatory networks. We validate some of these findings experimentally and we discuss their implications with regard to the analysis of metagenomic data and the nature of the SOS response.


top
A33:
Investigation of Filtering Metagenomic Sequencing Data on Assembly

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Tobias Ortega-Knight, University of the Virgin Islands

Author(s):
Alexis Black Pyrkosz, USDA, United States
Adina Chuang Howe, Michigan State University, United States
C. Titus Brown, Michigan State University, United States

Abstract:
Digital normalization, a new heuristic approach that reduces sequencing data to a minimum while preserving maximum information. We hypothesize metagenomic data filtered by digital normalization successfully reduces the data volume without losing information. We mapped reads to reference genomes from the Human Microbiome Project (HMP) mock data set. We then performed metagenomic assembly on raw, normalized, and normalized partitioned reads. Next, we mapped the raw reads to the de novo assemblies and compared them to the reference to determine if information from the normalized reads was lost and/or if there was an advantage in the normalized assemblies. The mapping analysis show the digitally normalized assembly had 1.33% increase in reads mapping compared to the raw read assembly and retained information that would have been lost. The normalized and partitioned assembly only lost 2.49% of reads that mapped when compared to the nonnormalized assembly results. The nonnormalized assembly took 57 minutes to complete and used approximately 16.1 gigabytes of memory. The digitally normalized assembly took 8 minutes to run, 7.125x faster than the nonnormalized assembly, and used 34 megabytes of memory, a 485% reduction in memory capacity. The digitally normalized and partitioned assembly finished in 11minutes (5.18x faster) and used 48 megabytes of memory, a 344% decrease in memory compared to the nonnormalized assembly. This investigation shows that digital normalization can reduce redundant reads and their associated errors in metagenomic sequencing data, thereby significantly decreasing the time and computational resources required to perform de novo assembly without sacrificing sensitivity.


top
A34:
Identification of genes involved in the stage of early embryonic development in mouse using microarray expression data

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Seung Gu Park, Department of Medical Biotechnology, College of Biomedical Science, and Institute of Bioscience & Biotechnology, Kangwon National University, Chuncheon 200-701, South Korea.

Abstract:
Maternal-to-Zygotic transition (MZT) that occurs following fertilization is the most important developmental transition in early embryonic development. It requires a dramatic reprogramming of gene expression, i.e., degradation of maternal RNAs and activation of zygotic genome, which is essential for continued development. We compared gene expression patterns of 6 different embryonic stages including zygote, cleavage, blastula, gastrula, somitogenesis and organogenesis with those from 54 different adult tissues. For this, we used GNF1M Gene Atlas for 22989 downloaded from GEO(http://www.ncbi.nlm.nih.gov/geo/) and GXD expression information for 9182 genes downloaded from MGI (http://www.informatics.jax.org). We then estimated how proportion of genes expressed in the 54 adult tissue types are also expressed in the 6 different embryonic developmental stages. It was found that 99 genes of putative zygote-expressing 191 genes (51.8%) are the genes expressed in oocyte, meaning that over half of the zygote-expressing genes are the oocyte-expressing genes. The proportion of oocyte-expressing genes in zygote dramatically decreased after cleavage stage. These patterns were not found in other proportion pairs between adult tissue and embryonic tissue, which is consistent with the idea of maternal-to-zygotic transition.


top
A35:
A Bioinformatics Analysis of the Vertebrate Twist Protein Family

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Yacidzohara Rodriguez, University of Puerto Rico Medical Sciences Campus

Abstract:
Twist protein family is expressed in a variety of different tissues during early stages of embryogenesis and their presence are essential for proper development and survival. Though a great deal has been elucidated concerning the highly conserved bHLH and carboxy domains of Twist proteins, more focused is needed at the N-terminal region and to the evolution of such proteins particularly in the mammal group of vertebrate animals. By sequence comparison we analyzed the conservation of the N-terminal of a variety of different Twist vertebrate proteins. We identified two putative de novo motifs (SSSPVSP and SEEE) at the N-terminal of such sequences, specifically in the mammal’s group. In addition, a number of mutations present in the NLSs of a couple of species demonstrate that there could be other possible residues influencing the nuclear localization of such proteins, with the G residue of the second NLS being a potential target. By phylogenetic analysis, we investigated the origin and evolution of the two paralog Twist proteins. First, a gene duplication event occurred to produce the two Twist paralogs. Second, among the two Twist paralog branches, Twist1 is the most distant in vertebrates; perhaps it evolved first due an abrupt increase of substitutions in a short period of time due to functional divergence. Also, in mammals, Twist1 seems to have a higher rate of evolution than Twist2 since only Twist1 sequences contain two glycine-rich motifs not present in Twist2. Our findings shed light on the relationship of Twist1 and Twist2 paralogs in the mammal group of vertebrate animals.


top
A36:
Cytochrome P450 genes of the Capitella telata (Polychaeta) genome

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Christopher Dejong, McMaster University

Author(s):
Joanna Wilson, McMaster University, Canada

Abstract:
The cytochrome P450 (CYP) family of proteins are monooxygenase enzymes responsible for a wide array of functions, including the metabolism xenobiotics and production of steroids. The annelid Capitella teleta genome was sequenced in 2007 and is in version 1 assembly. C. teleta is an increasingly important polychaete bio-indicator species in marine environments since it is able to thrive in polluted marine habitats such as those contaminated by oil spills. Yet there has been limited work on the metabolism of xenobiotics in this species. I have annotated the CYPs in this species, including various pseudogenes. Annotation was completed through homology searches and utilizing the available EST database. Phylogenetic analyses of the newly identified sequences show that many C. teleta CYPs cluster with known xenobiotic metabolizing CYPs from other species. A total of 73 CYP genes were identified and representatives from all the animal CYP clans were found. The presence of these CYPs may be part of their resiliency in contaminated habitats. Several steroidogenic candidate genes arose from this phylogenetic analyses from genes clustering with vertebrate steroidogenic CYP gene families. Future work will involve in silico functional analysis to help narrow down the possible function of several of these genes through domain homology, protein folding, and subsequent substrate binding studies. This data beyond its help in evolutionary biology, is important in understanding how C. teleta is able to withstand contaminant exposure, and will also help piece together the steroid hormone pathway in annelids.


top
A37:
Examining Copy Number Variation in Mouse piRNA clusters

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Raghunandan Avula, Carnegie Mellon University

Author(s):
Dr. Andy Perkins, Mississippi State University, United States
Dr. Federico Hoffmann, Mississippi State University, United States

Abstract:
Piwi-Interacting RNAs (piRNAs) are small noncoding RNAs that have recently sparked interest because they form complexes with Piwi proteins that play a role in transposon silencing. As major components of vertebrate genomes, transposable elements have the ability to make copies of themselves, mobilize, and insert themselves in different parts of the genome. This can have dramatic effects on genome evolution. Through sequence complementarity, piRNAs target the gene silencing activity of PIWI proteins to transcripts derived from transposable elements. Further examination of these clusters has revealed the existence of stable and dynamic clusters, which differ by number of times each cluster is found within the genome. Clusters can also be characterized by the timing of their acquisition during mammalian evolution as ancestral, before rat-primate divergence, or after rat-mouse divergence. Our central goal is to understand whether the local rates of copy number variation (CNV) impact a cluster’s characterization as stable or dynamic and its acquisition during evolution. By collecting and analyzing data on mouse piRNA clusters and copy number variations at various genomic locations, we used computational techniques to identify patterns between CNVs and piRNAs. Recently acquired clusters were found to have less CNVs than ancestral clusters suggesting that selection could play a role in their maintenance and supporting the idea that these clusters are important to an organism. Stable and dynamic clusters have similar amounts of CNVs however stable clusters are in more variable locations of the genome and have similar rate of CNVs to their surroundings.


top
A38:
QC Checker and SNP Genotype Editor: Two Application Tools for GWAS

Subject: other

Presenting Author: Carl McIntosh, National Cancer Institute

Author(s):
Xinlian Liu, Hood College, United States
Jennifer Troyer, Frederick National Laboratory for Cancer Research - SAIC, United States

Abstract:
Genome-wide association studies (GWAS) are well into their second generation, with several platforms available to identify genotype variation at hundreds of thousands of loci; even so, much debate remains as to the power of these studies to detect true positives. The need for properly powered studies is clearly central to these debates, but a secondary issue that is no less important is the need for pristine data sets. Two sources of data error can contribute to the risk of type II statistical errors; the inclusion of individuals with bad SNP calls and the inclusion of SNPs with bad calls for all individuals.


top
A39:
Churchill: A Cloud-Enabled, Ultra-Fast Computational Approach for the Discovery of Human Genetic Variation

Subject: Personalized Genetics & Genomics

Presenting Author: Ben Kelly, The Research Institute at Nationwide Children's Hospital

Author(s):
James Fitch, The Research Institute at Nationwide Children's Hospital, United States
Don Corsmeier, The Research Institute at Nationwide Children's Hospital, United States
Peter White, The Research Institute at Nationwide Children's Hospital, United States

Abstract:
Next generation sequencing (NGS) has revolutionized genetic research, empowering dramatic increases in the discovery of new functional variants. Compounded by declining sequencing costs, this exponential growth in data generation has created a computational bottleneck. Churchill is a computational approach that overcomes these challenges, fully automating the analytical process required to take raw sequencing instrument output through the complex and computationally intensive process of alignment and variant calling. Through implementation of novel parallelization techniques we have dramatically reduced the analysis time for whole human genome resequencing from weeks to hours, without the need for specialized analysis equipment or supercomputers. Compared with alternative analysis pipelines, Churchill is simpler, faster, deterministic and capable of running on all popular Linux environments. Furthermore, Churchill optimizes utilization of available compute resources and scales in a near linear fashion, enabling complete human genome resequencing analysis in ten hours with a single server, three hours with our in-house cluster and under two hours using a larger HPC cluster. Churchill is cloud-compatible and we demonstrate the expansive degree of parallelization Churchill can achieve using Amazon’s Elastic Cloud Compute (EC2) instances. Not only does this allow investigators to potentially reduce analysis time by leveraging the cloud's ability to easily scale, but it also enables investigators to perform low-cost resequencing analysis without needing to invest in the required infrastructure to build their own high-performance computing cluster. Churchill eliminates the NGS bioinformatics overhead and is a prime candidate to overcome the bottleneck even faster sequencing will create.


top
A40:
Transcriptomic analysis of the host response to respiratory syncytial virus in murine lung macrophages

Subject: Personalized Genetics & Genomics

Presenting Author: HUI CHEN, Nanyang Technological University

Author(s):
Richard Sutejo, Nanyang Techonological University, Singapore
Richard Sugrue, Nanyang Technological University, Singapore

Abstract:
Background: Respiratory syncytial virus (RSV)is a major cause of lower respiratory tract infections in infancy and childhood. Lung macrophages play essential roles in both innate and adaptive immune response, thus a comprehensive gene expression profile of RSV in macrophages was examined using a microarray system.
Results: The observation determined by microscopy and immunoblotting displayed that lung macrophages could be efficiently infected with RSV in vitro, with the expression of several virus structural proteins detected. Analysis of microarray data indicated that a total of 1791 and 2262 genes showed differential expression after RSV infection at 4 hpi and 24 hpi respectively. Among these genes, around two thirds of them showed up-regulated gene expression with high-level fold regulation. Detailed investigation demonstrated that these genes with elevated expression were functionally associated with cytokine, antiviral, cell death, and signal transduction. Furthermore, pathway analysis suggested that pathways related to pathogen recognition, interferon signalling and antigen presentation were enriched in these genes with increased expression. Genes with down-regulated expression were involved in functional groups such as signalling transduction, RNA binding and protein kinase.
Conclusion: This study has provided deep insights into the global host response in RSV-infected lung macrophages. Our data has revealed the activation of several antivirus signalling pathways such as interferon type I signalling and cell death signalling. Moreover, the sustained induction of pro-inflammatory cytokines at their expression level during the RSV infection has been also identified even in the absence of a productive infection. Thi


top
A41:
Applying digital normalization to transcriptome sequencing: effects of varying coverage

Subject: Personalized Genetics & Genomics

Presenting Author: Danny Lynch, University of the Virgin Islands

Author(s):
Adina Chuang Howe, Michigan State University, United States
Alexis B. Pyrkosz, USDA Avian Disease and Oncology Laboratory, United States
C. Titus Brown, Michigan State University, United States

Abstract:
Advances in next generation sequencing (NGS) have led to a wealth of data being produced by the scientific community. These large data sets require new software pipelines to lower the cost of transcriptome assembly. Digital normalization, a single pass algorithm that reduces the size of shotgun sequence data sets, is one such tool. The purpose of this study was to determine if digital normalization could be effectively used to reduce RNA-Seq data sets while retaining sufficient information for accurate de novo assembly. We hypothesized that digital normalization at varying coverage values would produce assemblies similar to those obtained from processing non-normalized data. Using a yeast reference transcriptome and a known RNA-Seq read set, we digitally normalized the raw reads at varying coverage levels. We then assembled the original and normalized read sets and compared the resulting transcriptome assemblies with the reference. We found that at varying coverage values, the digital normalized assembly contained fewer errors than the non-normalized data for the yeast dataset’s de novo transcriptome assembly. Furthermore, when the raw reads were aligned against each of the assemblies, it was discovered that the digitalized assemblies returned a 20% more accurate mapping than the non-digitalized assemblies. By more efficiently processing RNA-Seq data using digital normalization, it will be possible to assemble complete transcriptomes in a fraction of the current time. This increased knowledge will facilitate greater understanding of gene expression and function. Eventually, this could help improve direct patient care and future biomedical research endeavors.


top
A42:
The role of RNA dynamics in mRNA polyadenylation supported by classical genetics analysis

Subject: RNA & Protein

Presenting Author: Min Dong, Miami University

Author(s):
Guoli Ji, Xiamen University, China
Q.Quinn Li, Miami University, United States
Chun Liang, Miami University, United States

Abstract:
Using a bioinformatics approach that detects RNA structure motifs and RNA dynamics, we have re-examined the wet-lab results of in-vivo mutagenesis experiments in polyadenylation efficiency of many eukaryotic species. We found that hairpin-like structures and their junction connectors are common in 3’-end of mRNAs, which form specific RNA combinatory conformation with dynamics that are preferred by polyadenylation protein complex. Within such combinatory conformation, the cleavage site region (~ -/+ 15 nt, cleavage site at the 0 position) appears to have obviously higher free energy level than its flanking regions, forming a kinetic trap that facilitates protein-RNA binding in polyadenylation. The kinetic barrier, a short sequence region (~5-8 nt) immediately upstream of the kinetic trap, controls its downstream RNA sequence (from ~ -15 to +50) to form a bistable region where alternative, coexisting and convertible RNA structures exist. and one of the alternatives is preferred by RNA-binding proteins. Interestingly, the canonical poly(A) signal AAUAAA and its single- or di-nucleotide variants often act as the kinetic barrier. Mutagenesis results of polyadenylation efficiency experiments (i.e., in-vivo mutation of AAUAAA or AAUAAA-like motifs) proved the existence and importance of such kinetic trap and barrier in polyadenylation. Our study suggests that mRNA polyadenylation in eukaryotes involves complex interplay among RNA conformations and dynamics, rather than linear motif recognition.


top
A43:
Undergraduate Student

Subject: RNA & Protein

Presenting Author: Anthony Bortolazzo, San Jose State University

Author(s):
Natalia Khuri, University of California San Francisco, United States
Sami Khuri, San Jose State University, United States

Abstract:
According to recent estimates, up to half of all disease causing mutations disrupt splicing. These mutations can give rise to the activation of "cryptic splice sites" (CSS), regions of the pre-mRNA that are not otherwise spliced, producing an incorrect mRNA transcript. The activation of CSS has been implicated in cystic fibrosis, hereditary breast and ovarian cancers, as well as hemoglobinopathic diseases, such as β-thalassemia. β-thalassemia is an inherited blood disorder characterized by a lack or inadequate expression of β-globin, resulting in a less efficient transport of oxygen in the blood. β-thalassemia currently affects a large percentage of people in Africa, Mediterranean countries and southern Asia, and increasingly in the rest of the world due to migration. Fifteen out of over 200 β-thalassemia mutations give rise to cryptic splice site activation, thus making this gene an interesting model for studying mutations.
In this work, we developed a predictor of putative CSS in the human genome. We trained the model on 466 splice sites originating from 170 different genes. We validated the model by a 10-fold cross validation and tested it by predicting splice sites in the human β-globin gene. The performance of the model was evaluated by computing the area under the receiver operating characteristic curve (auROC). The value of the auROC was 0.98. The model correctly predicted 90.9% of known CSS in the human β-globin. We compared our model with existing cryptic splice site predictors. Our model can be used to streamline in vitro studies of cryptic splice sites.


top
A44:
Applying Cellular Crowding Models to Simulations of Virus Capsid Assembly in vitro

Subject: RNA & Protein

Presenting Author: Gregory Smith, Carnegie Mellon University

Author(s):
Lu Xie, Carngie Mellon University, United States
Byoungkoo Lee, Georgia State University, United States
Russell Schwartz, Carnegie Mellon University, United States

Abstract:
A major issue in modern biological research is the ability to accurately describe in vivo cellular processes based on in vitro experimental data. Virus capsid assembly presents a great example of this complication. Although it is an important model system for complex self-assembly processes, sources of experimental evidence for capsid assembly are very limited. We have previously applied numerical optimization methods to fit kinetic rate parameters of assembly simulations to light scattering data in order to more accurately model the in vitro viral capsid assembly process, making it possible to predict detailed patterns of interaction and pathways of assembly of specific viruses. There is significant evidence to suggest in vitro interaction patterns may be altered in a very different in vivo environment. We seek to examine one aspect of this difference, effects of intracellular molecular crowding. We have applied regression models developed from Green’s function reaction dynamic simulations to adjust inferred reaction kinetics of capsid models to reflect likely differences in crowded systems. We then examined how such adjustments affect computer models of the capsid assembly process. We applied these methods to three icosahedral viruses: human papillomavirus, hepatitis B virus, and cowpea chlorotic mottle virus. We discovered complicated effects on pathway dynamics and found a correlation between the presence or lack of nucleation-limited growth and a distinction in the effects of increased cellular crowding. We view this as a first step in exploring how additional features of the cellular environment can change our perception of the capsid assembly process.


top
A45:
Using a sequence-to-structure-to-function approach to understand disease significance: Evolution of the testis determining gene Sry and its protein family, Sox.

Subject: RNA & Protein

Presenting Author: Jeremy Prokop, The University of Akron

Author(s):
Monte Turner, The University of Akron, United States
Amy Milsted, The University of Akron, United States

Abstract:
In the age of sequencing, generation of sequence data no longer limits knowledge, but the ability to analyze and determine functionality of that sequence. The regions with easiest annotation of function are protein coding genes. We show the impact of using protein structure to map and analyze large sets of sequence data to determine previously undetermined regions of functionality. Using the dbSNPs dataset, personal genomes, the 1000 genomes, natural variants associated with disease phenotype, multiple mammalian sequence alignments, and synonymous/non-synonymous mutation rates combined with decades of molecular biology/biochemistry results we can begin to understand regions of the protein that are under selection to maintain proper function. This allows us to map DNA and protein binding potentials to these regions, allowing for a detailed view of functionality. Here, we present the analysis for the testis determining protein, Sry. We have mapped onto the structure of Sry the interactions with both DNA and androgen receptor. The significance of these sites was validated with GST pull-down experiments and site directed mutagenesis. Ascribing these interactions to amino acids of a protein structure, we can study the evolution of the SOX family of proteins, to which Sry belongs. Our results suggest a highly conserved interaction with androgen receptor and also Oct4. Using the protein structure allows for determination of conserved pockets in the proteins, that cannot be determined using sequence only. We suggest that this approach can be applied to all known proteins, to allow novel hypothesis generation of protein functionality.


top
A46:
Domain content-based clustering of proteins

Subject: RNA & Protein

Presenting Author: Neethu Shah, University of Nebraska, Lincoln

Author(s):
Stephen D. Scott, University of Nebraska, Lincoln, United States
Etsuko N. Moriyama, University of Nebraska, Lincoln, United States

Abstract:
Protein clustering and classification is important in predicting their functions and reconstructing their evolutionary relationships. Protein domains are the fundamental units of their structure, function, and evolution. Therefore, proteins and their functions can be defined and classified in terms of their domain content. In this study, we propose a methodology of protein clustering based on domain composition. We use the principle of co-clustering. Two variables, "proteins" and "domains", are simultaneously clustered into sub-groups, where each submatrix representing a protein subgroup meets a certain similarity criterion (e.g., a significant E-value threshold). We first apply Bimax, a biclustering algorithm, which employs divide and conquer strategy from a simple binary model in finding all the maximal bicliques. A single biclique represents sets of proteins and domains that form a complete bipartite graph, in which every vertex of the protein set is connected to all the vertices of the domain set. Each bicluster, thus, reveals a domain co-occurrence pattern. Further analysis of each bicluster is performed to divide each bicluster into strongly connected ortholog groups and weakly connected groups of paralogs based on both architecture and similarity of domains in each protein in the cluster. This protocol is applied to more than 10 proteome data sets including both from prokaryotic and eukaryotic genomes. Clustering patterns obtained by our method are compared against the results obtained by using phylogenetic profile based methods such as GDDA-BLAST and PHYRN, and BLAST similarity based methods such as Inparanoid and OrthoMCL.


top
A47:
Analyzing the computational demands of the Trinity RNA-Seq assembler.

Subject: RNA & Protein

Presenting Author: Jonathan Strickland, Pittsburgh Academy of Science and Technology

Author(s):
Alexander Ropelewski, Carnegie Mellon University, United States

Abstract:
With the affordability of next generation sequencers and the data they produce, RNA sequencing is becoming increasingly available to researchers. To assemble RNA sequence data, de novo de Brujin based assemblers, such as Trinity, are becoming preferred over assemblers that map reads to a reference sequence. Unfortunately, de Brujin assemblers come with high computational demands, both in terms of memory and time. In this poster, we quantify the computational demands of Trinity through the use of several different paired and unpaired RNA sequence data sets from NCBI’s sequence read archive. We further illustrate with these datasets how computationally demanding the specific stages of the Trinity processes are and how well each stage scales on our test platforms. Our research illustrates that on the test datasets the Chrysalis stage (cluster contigs, build de Brujin graph), consumed about 43% of the total runtime, followed by Butterfly (build sequence) at 30%, and Inchworm (assemble contigs) at 20%. The remaining time was divided among other Trinity processes. The standard deviations for the Inchworm, Chrysalis, and Butterfly runtimes were all approximately 10%.


top
A48:
Comparative Analysis of Class I Methyltransferases

Subject: RNA & Protein

Presenting Author: Dorothy McAfee, Franciscan University of Steubenville

Abstract:
S-adenosyl methionine (AdoMet)-dependent methyltransferases catalyze the transfer of the methyl group from AdoMet to various substrates. The targets of methylation include carbons, oxygens, nitrogens, and sulfurs. The goal of this undergraduate research project was to compare the structural, functional, and phylogenetic relationships of methyltransferases by aligning 256 amino acid sequences of seven different types of methyltransferases across various organisms. Overall, these seven groups of methyltransferases shared little sequence identity (less than 10% for some). No residues were invariantly conserved and only two residues, both glycines found in the AdoMet-binding Rossman fold, are above 80% conservation. Another well-conserved position has an acidic residue which hydrogen bonds to the AdoMet ribose hydroxyls. Many residue positions had conserved hydrophobic similarities. Despite low sequence conservation, the structural domain containing the Rossman fold that binds AdoMet is conserved in all class I methyltransferase types, while domains involved in substrate binding were unique to each type. Pattern analysis identified the twenty most well-conserved sequence motifs in the methyltransferases. Only motif 8 was found in all seven types; the rest were type-specific. Phylogenetic analysis revealed distinct clades for each type of methyltransferase and their relation to each other. Group entropy analysis also demonstrated residues unique to each type of methyltransferase.


top
A49:
Comparative Analysis of Pentapeptide Repeat Proteins

Subject: RNA & Protein

Presenting Author: Nicholas Cundiff, Franciscan University of Steubenville

Abstract:
Pentapeptide repeat proteins share a common structure which is comprised of a five-amino acid residue repeated sequence. These proteins are found in most living organisms, ranging from bacteria to plants and animals. Of particular interest is the MfpA protein in Mycobacterium tuberculosis, which helps to confer resistance to fluoroquinolone antibiotics, such as ciprofloxacin. This is thought to be due to the striking resemblance of the protein’s tertiary structure to a DNA double helix. MfpA binds to DNA gyrase thereby preventing the fluoroquinolone from binding to the DNA and eliciting it’s effects. This undergraduate research project sought to compare 173 amino acid sequences from a range of organisms and determine their structural, functional, and phylogenetic relationships. Besides the inherent repeated pentapeptide sequence, very little specific residue conservation is observed. The main location of conservation was a series of leucine residues at position “i" on each face of the molecule, thus suggesting the importance of hydrophobic packing in the core of the protein. Despite little sequence conservation, the overall tertiary structure is highly conserved with four faces on the molecule surface and is very analogous of a DNA molecule. Pattern analysis revealed the most well-conserved sequence motif comprises the pentapeptide repeat and is found throughout all aligned sequences, whereas the other nine motifs were found in a taxonomic-specific manner. Phylogenetic analysis demonstrated distinct clades based on taxonomic differences, with plant and cyanobacterial sequences being closely related.


top
A50:
CAFA-BC and CAFA-AD: benchmarking and assessment tools for protein function prediction

Subject: Databases & Ontologies

Presenting Author: Rajeswari Swaminathan, Miami University

Abstract:
Standardized benchmarking is needed for the the assessment of protein function prediction methods. Here we present CAFA-BC and CAFA-AD, to create benchmarks and to assess function prediction method performance. The software was initially created due to a need identified at the Critical Assessment of protein Function Annotation challenge (CAFA). However, it is intended for general use, and is not limited to CAFA applications.
The benchmark creator CAFA-BC takes as input two UniProt-GOA files from two historical time points. A benchmark set is created comprising the proteins that were electronically annotated in t1, but became experimentally validated in t2.Given a reasonable time gap between t1 and t2 (over 6 months), a significant number of proteins would receive experimental validation. This benchmark set can be customized according to the needs of a user, by species, ontologies, GO evidence codes, and other criteria. The assessment software, CAFA-AD, uses assessed software performance over the CAFA-BC benchmark to assess predictions. The protein-centric assessment uses precision/recall to assess how well a method predicts any given protein's function. The term-centric measure uses a ROC curve to assess how well any given ontology term is predicted .
CAFA-BC and CAFA-AD are provided under an open-source license and are free to use, modify and redistribute.


top
A51:
Application of Topological Gene Ontology Metric on Proteins with Altered Expressions in Lung Cancer

Subject: Databases & Ontologies

Presenting Author: Andrew Marmaduke, University of Akron

Abstract:
The Gene Ontology is an initiative to represent and annotate genes under a controlled vocabulary and provide tools for easy access to the represented information. Gene Ontology is a useful database to both represent and compare proteins or other functional processes. Various metrics exist in order to compare the similarity between proteins. In particular the topological-based metric of Mazandu et al was used in this work to compare proteins expressed differently in cancerous cells. The topological based approach is capable of taking advantage of the DAG structure of the ontology. The distance or relative level of a node with respect to it's ancestors is captured in the similarity metric. With this metric the similarity between proteins expressed differently in cancerous cells is compared. The similarity between two panels of proteins expressed differently in varying subtypes of lung cancer with a control is explored and reported


top
A52:
Biological Database of Images and Genomes: tools for community annotations linking image and genomic information

Subject: Databases & Ontologies

Presenting Author: Andrew Oberlin, Miami University

Author(s):
Dominika Jurkovic, Miami University, United States
Mitchell Balish, Miami University, United States
Iddo Friedberg, Miami University, United States

Abstract:
Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype-genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require
linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management, and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. BioDIG stands out not only as a structure for annotating images and linking genomic information, but also provides tools for users to collaborate on projects and work together seamlessly. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of Mycoplasma. This software is available under an open source license via http://biodig.org.


top
A53:
The Distribution of Single Nucleotide Polymorphisms in Genomes of Patients with Pyoderma Gangrenosum: Biomarker Discovery

Subject: Databases & Ontologies

Presenting Author: Heather Mercer, Kent State University

Author(s):
Mary Halpin, Kent State University, United States
Judy Fulton, Akron General Medical Center, United States
Eliot Mostow, Northeast Ohio Medical University, United States
Helen Piontkivska, Kent State University, United States

Abstract:
Pyoderma gangrenosum (PG) is a poorly healing difficult to diagnose wound. Thus, genomic biomarkers that can aid in faster and more reliable PG diagnosis are needed. Six PG patients were genotyped (Affymetrix SNP 6.0). Bioinformatics analyses of 64,997 SNPs found in 17,517 genes shared among PG patients were conducted. We hypothesized that as a multi-factorial disease, PG can be driven by specific genetic variants that interfere with the inflammatory pathways, including neutrophil clearance. Functional gene annotations were conducted to determine whether gene categories of shared SNP variants are over-enriched by the immune- and/or inflammation and/or apoptosis variants. Out of 17,517 genes that shared an allelic state between PG patients, the majority (14,654 genes) were shared in homozygous state (in other words, the same alleles were shared by homologous chromosomes), while about 16% (2,863) were shared in a heterozygous state. Of the latter category, there were about 26 clusters with enrichment scores of 2.0 and above, of which at least 6 (encompassing 590 genes) contained genes annotated as related to immune/inflammation processes. This represents a significant over-enrichment of genes related to immune function/inflammation compared to approximately 5.7% (1083) out of 19,042 human protein-coding genes that are classified as such. To gain insights into potential molecular mechanisms behind PG we are examining other functional properties of shared SNPs, including those located in non-coding regions. The latter category, while currently lacking known functional annotations, may provide insight into the genome function as putative regulatory regions and/or non-coding RNA.


top
A54:
Pooling Microarray data for Multiple Sclerosis (MS), Rheumatoid Arthritis (RA), Systemic Lupus Erythematosus (SLE) and Sjögren’s Syndrome (pSS)

Subject: Databases & Ontologies

Presenting Author: Eric Haynes, University of Toledo

Abstract:
Our goal for this study revolves around defining the prevalence of the human autoimmune diseases; Multiple Sclerosis, Rheumatoid Arthritis, Systemic Lupus Erythematosus and more recently Sjögren’s Syndrome in the human population. Each of these diseases is known to be gender-biased, primarily surfacing in middle-aged Caucasian women. The female to male ratios were nearly 2:1 (for MS and RA) in contrast to 9:1 (for SLE and pSS). Each year, an increasing number of new cases are reported to health-care professionals. Substantial under-reporting from misdiagnosis, wherein practitioners could not properly identify an individual’s disease was generally caused by undiagnosed, missing or dormant symptoms. For health care professionals, by not being able to properly align symptoms with established disease criterion, it historically has led to unnecessarily higher disease rates involving discomfort, disability and in some cases death.
Our study intends to find differentially expressed genes that are found not only in each disease but in these other autoimmune diseases, leading to the potential development of predictive biomarkers. The necessity for this study was based upon comparative complications; involving differences in commercial and proprietary platforms, varying sample sizes, sample origins (tissue versus blood) and the disproportionately of sample information. While attempts have been made to use formal submissions; data from one study may include participant characteristics such as age, gender, disease progression/stage. Omission of age and gender data hinders the aspect for identifying these factors with regard to initial disease progression, potentially limiting the discovery of these genes or useful biomarkers.


top
A55:
Why MD Equilibrated Protein Structures Are Different Than Experimentally Determined: A Thermodynamic Insight

Subject: Chemical Biology

Presenting Author: Filippo Pullara, University of Pittsburgh

Author(s):
Mert Gur, University of Pittsbrugh, United States
Wenzhi Mao, University of Pittsbrugh, United States
Ivet Bahar, University of Pittsbrugh, United States

Abstract:
Anyone who has performed conventional molecular dynamics (MD), at least once, should have observed that conformers diverge structurally from their starting X-ray crystal structures, sometimes even drastically. Several studies have credited such structural differences, and attempted to provide explanation for this behavior and justify the necessity of MD equilibrations. Although these studies have provided us with important insights, a detailed explanation based on fundamental physics and following validation on a large ensemble of protein structures is still missing.

In this study we first provide thermodynamic insight to the radically different thermodynamic conditions of crystallization solutions and MD simulation environment. Crystallization solution conditions can lead to unphysiologically high ion concentrations, low temperatures and crystal packing with strong specific protein-protein interactions that are not present in physiological conditions. These differences affect protein conformations and functions, due to which equilibrated MD structures are expected to be different from their X-ray structure.

To validate this claim we performed conventional MD simulations for 70 different proteins. The RMSD between the crystal and MD structures yielded values ranging from 1.2-5.0Å after 10 ns of simulations and up to 14 Å after 100ns. Our analysis shows that X-ray structures are good starting points but do not perfectly represent the physiological conditions. This fact has to be taken into consideration when any kind of computational method, such as docking, is used to guide experimental analysis.


top
A57:
AERSMine: A Phenome-Pharmacome Web Datamine based on the FDA’s Adverse Events Reporting System Data

Subject: other

Presenting Author: Mayur Sarangdhar, Cincinnati Children's Hospital Medical Center

Author(s):
Akash Kushwaha, Cincinnati Children's Hospital Medical Center, United States
Anil Jegga, Cincinnati Children's Hospital Medical Center, United States
Bruce Aronow, Cincinnati Children's Hospital Medical Center, United States

Abstract:


top
A58:
Metabolic reconstruction and analysis explain differences in virulence of dominant strains of Protozoan parasite Toxoplasma gondii.

Subject: Biological Networks

Presenting Author: Nirvana Nursimulu, University of Toronto

Author(s):
Carl Song, University of Toronto, Canada
Melissa Chiasson, Laboratory of Parasitic Diseases, NIAID, National Institutes of Health, United States
Nirvana Nursimulu, University of Toronto, Canada
Stacy Hung, University of Toronto, Canada
James Wasmuth, Faculty of Veterinary Medicine, University of Calgary, Canada
Michael Grigg, Laboratory of Parasitic Diseases, NIAID, National Institutes of Health, United States
John Parkinson, University of Toronto, Canada

Abstract:
Toxoplasma gondii is a single celled parasite believed to chronically infect at least a third of the world’s population. Population genetics studies reveal at least three dominant strains, termed type I, II and III, each characterized by distinct virulence profiles. In addition to expression of key surface antigens, we hypothesize that metabolic potential plays a role in virulence. To explore this hypothesis, we performed a metabolic reconstruction and constraints based analysis of T. gondii in its proliferative tachyzoite form. Through integration of functional genomic data obtained from different strains, flux balance analysis of our model (termed iCS378) correctly predicts an increased growth rate in the more virulent type I strain, relative to type II. Increase in growth rate appears to be related to an increase in energy production via upregulation of the glycolytic, pentose phosphate and TCA cycle pathways. These findings highlight a regulatory route for mediating growth rate plasticity with the potential to impact the parasite’s remarkable ability to infect a broad range of hosts. In silico single- and double-gene knock out experiments predict several reactions with strain-specific sensitivities. Drug assays confirm these predictions, with the type II strain appearing more sensitive to drugs targeting a key enzyme in the glycolytic pathway compared to the type I strain. By highlighting the influence of metabolic potential on T. gondii virulence, this study demonstrates the need to assess the genetic diversity of infectious agents in considering effective therapeutic interventions.


top
A59:
Scoring protein subnetworks in terms of their disease association and network connectivity: How to compare apples and oranges?

Subject: Biological Networks

Presenting Author: Marzieh Ayati, Case Western Reserve University

Author(s):
Sinan Erten, Case Western Reserve University, United States
Mehmet Koyuturk, Case Western Reserve University, United States

Abstract:
Network and pathway-based analyses are commonly used as powerful tools to interpret the findings of genome-wide association studies (GWAS) in a functional context. In particular, identification of disease-associated functional modules, i.e., highly connected protein-protein interaction subnetworks with high aggregate disease association are shown to be promising in uncovering the functional relationships among disease-associated genes and proteins. An important issue in this regard is the scoring of subnetworks by integrating two quantities that are not readily compatible: disease association of individual gene products and network connectivity among proteins. Current scoring schemes either disregard the level of connectivity and focus on the aggregate disease association of connected proteins or use a linear combination of these two quantities. However, such scoring schemes may produce arbitrarily large subnetworks which are often not statistically significant, or require tuning of parameters that are used to weigh the contributions of network connectivity and disease association.

Here, we propose a parameter-free scoring scheme that aims to score subnetworks by directly incorporating the statistical significance of network connectivity and disease association. We test the proposed scoring scheme on two independent GWAS datasets of two different disease, type II diabetes and Breast Cancer. We implement greedy algorithm to identify modules based on three different approaches of scoring. Our results suggest that subnetworks identified by commonly used methods may fail tests of statistical significance after correction for multiple hypothesis testing. In contrast, the proposed scoring scheme yields highly significant subnetworks that are biologically relevant and are reproducible across independent cohorts.


top
A60:
Xander: Tool for Gene-targeted Metagenomic Assembly

Subject: Algorithm Development & Machine Learning

Presenting Author: Jordan Fish, Michigan State University

Author(s):
Wang Qiong, Michigan State University, United States
Yanni Sun, Michigan State University, United States
C. Titus Brown, Michigan State University, United States
James Tiedje, Michigan State University, United States
James Cole, Michigan State University, United States

Abstract:
Soil and other very large metagenomes tax the abilities of current-generation short-read assemblers. In addition to space and time complexity issues, most assemblers are not designed to correctly treat reads from closely related populations of organisms. For gene-centric analyses only assembling genes of interest can reduce the computational complexity for extracting information from a metagenomic dataset. We introduce Xander, a tool for gene-targeted metagenomic assembly.
Xander uses information about specific genes to guide assembly, and gene annotation occurs concomitantly with assembly. This approach combines a space-efficient De Bruijn graphical representation of the reads, a bloom filter, with a protein profile Hidden Markov Model for the gene(s) of interest. To limit the search, we use a heuristic to first identify nucleotide k-mers that translate to peptides found in a set of representatives of the target protein family. These k-mers, along with the positions of the peptides in the HMM representation, define a set of search start points. Contigs are then assembled by applying graph path-finding algorithms in both directions on the combined De Bruijn-HMM graph structure. Multiple best paths can be found to explore population microheterogeneity.
Using this technique, we have been able to extract complete nifH and rplB protein coding regions from several 50G soil metagenomes, including metagenomes from an Iowa great prairie soil and soils planted with Miscanthus and Switchgrass, two potential biofuel crops.
Future work will focus on separating sequencing artifacts from low-coverage rare populations.


top
A61:
Matrix Algorithms for Genome Evolution (MAGE)

Subject: Algorithm Development & Machine Learning

Presenting Author: shuhao qiu, University of Toledo

Author(s):
andrew mcsweeny, The University of Toledo, United States
arnab saha mandal, University of Toledo, United States
Samuel Choulet, University of Toledo, United States
alexei fedorov, the university of Toledo, United States

Abstract:
We performed computational simulations modeling the influx of 50 novel point mutations per individual (the real rate observed in the human genome) in order to determine genetic parameters most crucial for maintaining population fitness. Various types of distribution of mutations by their selection coefficients (ranging from 100% neutral to 100% non-neutral models) have been investigated. Under the neutral settings (no selection) our results were consonant with Kimura’s Neutral Theory wherein K / µ = 1 (K being number of substitutions per generation, µ being the mutation number per gamete). However, even the presence of a minor pool of not-neutral mutations often resulted in a significant deviation of the K / µ - ratio from 1. We found that the K / µ - ratio depended on the following parameters: 1) Mutational rate; 2) Number of children per couple; 3) Number of meiotic recombination events during formation of gametes; 4) The dominance mode that determines how fit the heterozygotes are in relation with the homozygotes. Our results show K / µ ratio varied from as large as 3.6 to as small as 0.4, under realistic conditions for human population. We also found that, while mutation rate is small it prevents a decline in fitness over time. After introduction of more than 5 mutations per gamete, the fitness of the next generations starts to decline proportional to the influx of mutations. The decline of fitness can be averted by increasing the number of meiotic recombinations in gametes.


top
A62:
Genetic Backpropagation: A New Method for the Identification of Differentially Expressed Genes in a Microarray Data Set

Subject: Algorithm Development & Machine Learning

Presenting Author: Bryan Herman, University of Akron

Abstract:


top
B01:
From data towards knowledge: Revealing the architecture of signaling systems by unifying knowledge mining and data mining of systematic perturbation data

Subject: Algorithm Development & Machine Learning

Presenting Author: Songjian Lu, University of Pittsburgh

Author(s):
Xinghua Lu, University of Pittsburgh, United States
Bo Jin, University of Pittsburgh, United States
Ashley Cowart, Medical University of South Carolina, United States

Abstract:
Genetic and pharmacological perturbation experiments, such as
deleting a gene and monitoring gene expression responses, are
powerful tools for studying cellular signal transduction pathways.
However, it remains a challenge to automatically derive knowledge
of a cellular signaling system at a conceptual level from
systematic perturbation-response data. In this study, we explored
a framework that unifies knowledge mining and data mining
approaches towards the goal. The framework consists of the
following automated processes: 1) applying an ontology-driven
knowledge mining approach to identify functional modules among the
genes responding to a perturbation in order to reveal potential
signals affected by the perturbation; 2) applying a graph-based
data mining approach to search for perturbations that affect a
common signal with respect to a functional module, and 3)
revealing the architecture of a signaling system organize
signaling units into a hierarchy based on their relationships.
Applying this framework to a compendium of yeast
perturbation-response data, we have successfully recovered many
well-known signal transduction pathways; in addition, our
analysis have led to many hypotheses regarding the yeast signal
transduction system; finally, our analysis automatically organized
perturbed genes as a graph reflecting the architecture of the yeast
signaling system. Importantly, this framework transformed
molecular findings from a gene level to a conceptual level,
which readily can be translated into computable knowledge in the
form of rules regarding the yeast signaling system, such as ``if
genes involved in MAPK signaling are perturbed, genes involved in
pheromone responses will be differentially expressed''.


top
B02:
Transfer Learning of Classification Rules through Functional Modules

Subject: Algorithm Development & Machine Learning

Presenting Author: Henry Ogoe, University of Pittsburgh

Author(s):
Xinghua Lu, University of Pittsburgh, United States
Vanathi Gopalakrishnan, University of Pittsburgh, United States

Abstract:
Recent studies have shown that there exist many-to-many relationships between certain diseases and their causative genes, which are functionally related. These functionally related genes, which we refer to as functional modules (FMs), work in tandem to drive biochemical pathways or processes. Several studies have linked dysfunctional FMs to some diseases. Revealing FMs within a set of arbitrary genes could help researchers gain insight into the etiology of diseases. In addition, FMs could serve as catalysts for abstracting rules for knowledge transfer between different but related learning tasks. We conducted an exploratory study on a method to identify FMs using spectral clustering in combination with the Gene Ontology. By using genes extracted from some metabolic pathways of saccharomyces cerevisiae, we identify FMs that were associated with stress response, cell aging, maintenance of stability, and host of others. In addition, our method revealed pleiotropic genes such as ZWF1and a co-occurrence of several families of genes in some FMs. These results support claims by other studies that, for most complex diseases caused by genes, FMs exist. We have also applied our method of discovering FMs to map the relatedness between a source and target domain for transfer learning of classification rules for biomarker discovery. We tested this concept on lung adenocarcinoma datasets to generate classification rules for prognostic markers and the results were promising. Subject to further refinement, our method could develop into a framework for generating mapping functions that could facilitate transfer learning between related but different biological datasets.


top
B03:
Cohort Analyzer for Large Scale Non-Mendelian Disease Studies

Subject: Algorithm Development & Machine Learning

Presenting Author: Andrew Plassard, Cincinnati Children's Hospital

Abstract:
The purpose of this research is to develop a novel method for the analysis of genetic mutations that are present in individual patients presenting with non-Mendelian phenotypes. This approach uses a feature impact scoring for specific protein domains and machine learning approaches built off of a training set containing common mutations seen in dbSNP as negative training and highly damaging mutations in OMIM as positive training. The mutations that are high scoring are then analyzed further to determine their potential role in a phenotype or disease state.


top
B04:
TEAK: Topology Enrichment Analysis frameworK for detecting activated biological subpathways

Subject: Bioimage Analysis

Presenting Author: Thair Judeh, Wayne State University

Author(s):
Cole Johnson, University of Michigan, United States
Anuj Kumar, University of Michigan, United States
Dongxiao Zhu, Wayne State University, United States

Abstract:
To mine gene and compound expression data sets effectively, analysis frameworks need to incorporate methods that identify intergenic relationships within enriched biologically activated subpathways. For this purpose, the Topology Enrichment Analysis frameworK (TEAK) exploits subpathways and their topologies. Using a novel in-house algorithm and a tailor-made Clique Percolation Method, TEAK extracts linear and nonlinear KEGG subpathways, respectively. TEAK supports both context specific data, i.e. a single data matrix consisting of relevant mutant samples or time-series data, and case-control data, i.e. two data matrices. For each subpathway, TEAK fits a single Gaussian Bayesian network for context specific data while TEAK fits two Gaussian Bayesian networks for case-control data. For scoring TEAK uses the Bayesian Information Criterion for context specific data and the Kullback-Leibler divergence for case-control data. Using experimental studies in conjunction with TEAK, microarray data sets profiling stress responses in the model eukaryote Saccharomyces cerevisiae were analyzed. For a public microarray data set, TEAK identified a set of linear sphingolipid metabolic subpathways activated during the yeast response to nitrogen stress, and phenotypic analyses of the corresponding deletion strains have indicated previously unreported fitness defects for the dpl1Δ and lag1Δ mutants under conditions of nitrogen limitation. In addition, the yeast filamentous response to nitrogen stress was studied by profiling changes in transcript levels upon deletion of two key filamentous growth transcription factors, FLO8 and MSS11. TEAK identified a nonlinear glycerophospholipid metabolism subpathway involving the SLC1 gene, which was found via mutational analysis to be required for yeast filamentous growth.


top
B05:
Distinct Signaling Roles of Ceramide Species in Yeast Revealed Through Systematic Perturbation and Integromics Analyses

Subject: Biological Networks

Presenting Author: Lujia Chen, University of Pittsburgh

Abstract:
Ceramide, the central molecule of sphingolipid metabolism, is becoming recognized as an important second messenger with implications for disease and for the fundamental study of cell signaling. One of the most pressing challenges in deciphering ceramide signaling emanates from the myriad of ceramide species known to exist and the possibility that many of them may have distinct functions. Here, using yeast as a model system, we designed and applied innovative systems biology approaches to perturb ceramide metabolism, which enabled us to infer causal relationships between ceramides species and their potential targets by combining lipidomic, genomics, and transcriptomic analyses. This approach revealed that different ceramide groups are regulated by distinct metabolic mechanisms during heat stress and that distinct ceramide species participate in disparate signaling pathways. These results indicate a previously unrecognized complexity and versatility of lipid-mediated cell regulation.


top
B06:
Visualization strategies for large-scale multivariate genomic information

Subject: Biological Networks

Presenting Author: Darya Filippova, CMU

Author(s):
Rob Patro, CMU, United States
Geet Duggal, CMU, United States
Hao Wang, CMU, United States
Emre Sefer, CMU, United States
Carl Kingsford, CMU, United States

Abstract:
With the advent of high-throughput sequencing, a single species can
translate to gigabytes of data describing its genome sequence, gene
expression across different tissues and conditions, gene regulatory
and protein interaction networks, genome variations from reference
sequence, as well as data on spatial proximity of genomic regions in
the folded state in the nucleus. UCSC Genome Browser tracks have long
been the de-facto standard for viewing these multiple variables mapped
onto the genomic coordinates. However, genomic tracks have generated
criticism for lack of flexibility when integrating new relational
variables (gene regulation or spatial proximity data): a relationship
between genomic regions are hard, if not impossible, to display using
the linear layout: gene regulation, protein interaction, and 3D
spatial co-location require a non-linear approach. Additionally,
simple tasks such as a chromosome or genome overview as well as
comparison between multiple regions are not easily performed with
linear tracks. Circular layouts like Circos solve the problem of
displaying relations, but suffer from a problem of information
clutter: individual relations are impossible to follow amid thousands
of intersecting and overlapping lines. We explore a graph approach to
displaying genomic information where the primary visual relationship
is the spatial co-location. We enhance the visualization by mapping
genomic annotations onto the graph's nodes and edges and use a novel
layout that avoids clutter by respecting node size. We show that this
approach allows to explore genomes as a whole and encourages
hypothesis generation.


top
B07:
ANALYSIS OF MULTIGENE DISORDERS USING THE LYNX SYSTEM

Subject: Biological Networks

Presenting Author: Sandhya Balasubramanian, The University of Chicago

Author(s):
Natalia Maltsev, The University of Chicago, United States
Conrad Gilliam, The University of Chicago, United States
Dina Sulakhe, The Computation Institute, United States
Paul Dave, The Computation Institute, United States
Bingquin Xie, The Illinois Institute of technology, United States
Bo Feng, The Illinois Institute of technology, United States
Eduardo Berrocal, The Illinois Institute of technology, United States
Andrew Taylor, The University of Chicago, United States
Daniela Boernigen, Harvard school of Public Health, United States

Abstract:
ANALYSIS OF MULTIGENE DISORDERS USING THE LYNX SYSTEM

Keywords: translational medicine, biological networks, gene prioritization

Progress in understanding of molecular mechanisms underlying complex heritable disorders (e.g., autism, schizophrenia, diabetes) depends on new bioinformatics approaches for systems-level analysis and identification of disease-specific patterns of inheritance.
We present an approach and a supporting computational platform LYNX (http://lynx.ci.uchicago.edu/) for the analysis of complex heritable disorders from the systems biology perspective. Our approach is based on a large-scale integration of genomic and clinical data and various classes of biological information from over 35 public and private databases. This data is used for the identification of genes and molecular networks contributing to phenotypes of interest, as well as for the prediction of additional high-confidence disease genes to be tested experimentally. Our analytical strategy includes: (a) the enrichment analysis of high-throughput genomic data; (b) feature-based gene prioritization and (3) the development of network-based disease models for the identification of molecular mechanisms involved in disease pathogenesis. Networks-based gene prioritization leverages previous work of Dr. Börnigen-Nitsch on PINTA system. The algorithms were modified to accommodate a variety of weighted data types for gene prioritization. We will illustrate the utility of our approach using the prediction of the molecular mechanisms underlying the brain connectivity disorders (e.g. agenesis of corpus callosum, autism) and other neurodevelopmental disorders as examples. This knowledge will eventually lead to the development of efficient diagnostic and therapeutic strategies.


top
B08:
Integrative Phenotyping Framework (iPF): Integration of Multiple Omic Data Identifies Novel Disease Phenotypes via integrative clustering

Subject: Algorithm Development & Machine Learning

Presenting Author: SungHwan Kim, University of Pittsburgh

Author(s):
Dongwan Kang, Lawrence Berkeley National Laboratory, United States
George C. Tseng, University of Pittsburgh, United States
Naftali Kaminski, University of Pittsburgh, United States

Abstract:
Disease phenotyping faces a new era, now that extensive phenotypic information attributed to technological advancement can realize an accurate description and a diagnosis of diseases. Newly discovered phenotypes can identify novel subtypes of a disease, and precisely subdivide a current diagnosis involving tailored treatments and personalized medicine. In this research we introduce a general integrative phenotyping framework (iPF) via integrative clustering: iPF performs ‘Feature fusion’ and ‘Feature relocation’ by multi-dimensional scaling (MDS) to merge multiple datasets. Under the derived relocations, we carry out smoothing intensities of features via Generalized additive model (GAM) along with spline tensor products, and unsupervised clustering to identify underlying subtype structures based on each omics dataset as well as all combined datasets. In order to verify the validity of iPF, we obtained 669 clinical variables and 15,966 gene expressions from 471 patients, who were initially diagnosed with chronic obstructive lung disease (COPD) or interstitial lung disease (ILD). Without being restricted by given clinical diagnosis, iPF identified 3 reproducible major groups of patients characterized in having distinct phenotypes with respect to the given two diseases (ILD and COPD). In particular, we observed different combinations of the distinctive phenotypes in the intermediate groups, which led to the idea that COPD and ILD may not be in phenotypic extremes, but may imply overlapping multiple syndromes as well as diverging mechanisms. Furthermore, iPF can serve to elucidate an inter-connection between a subject distribution and informative phenotypic features in application with integrated omics datasets.


top
B09:
Computational Studies of Drug Resistance Mechanism and Dynamic Properties of Influenza Virus

Subject: Disease Models & Molecular Medicine

Presenting Author: Nanyu Han, Nanyang Technological University

Author(s):
Yuguang Mu, Nanyang Technological University, Singapore

Abstract:
It is critical to understand the molecular basis of the drug resistance to influenza viruses, in order to efficiently treat this infectious disease. Recently, H1N1 strains of influenza A carrying a mutation of Q136K in neuraminidase were reported. This new strain showed a strong Zanamivir neutralization effect. In this study, normal molecular dynamics simulations and metadynamics simulations were employed to explore the mechanism of Zanamivir resistance. Hydrogen-bond network analysis showed weakened interaction between the Zanamivir drug and E276/D151 on account of the electrostatic interaction between K136 and D151. Metadynamics simulations showed that the free energy landscape in the mutant is different from that in the wild-type neuraminidase, suggesting a weaker binding. This study indicates that the deformation of the 150-loop together with the induced altered hydrogen-bond network is the potential reason in development of Zanamivir resistance.
In addition, Hamiltonian replica exchange molecular dynamics simulations were also performed to explore the plasticity of 150-loop which has exhibited its importance in the stability of the above drug resistance study. The free energy landscape of the 150-loop was extensively explored, and the most dynamical motif was identified. This enhanced sampling simulation together with the discovery of drug resistance mechanism provides more information in further structural-based drug discovery on influenza virus.


top
B10:
Modeling and synthesis of 3D cellular and nuclear shape using a shape space learned from fluorescent microscopic images

Subject: Bioimage Analysis

Presenting Author: Taráz Buck, Carnegie Mellon University and University of Pittsburgh

Author(s):
Gustavo Rohde, Carnegie Mellon University, United States
Robert Murphy, Carnegie Mellon University, United States
Wei Wang, Beijing Jiaotong University, China

Abstract:
Cell shape is an important indicator and consequence of cell state. Position in the cell cycle, state of differentiation, disease, drug treatment, and other variables and perturbations cause often marked changes in morphology. Work by our and other groups can effectively represent various classes of cell shape extracted from microscopic images, but a particular method can usually only encode a restricted class of shapes that cannot fully capture the shape of a neuron or even of some fibroblasts. Here we expand a nonparametric framework for learning a generative model of single two-dimensional shapes to represent and synthesize nested three-dimensional cell and nuclear shapes. This extension enables simultaneous modeling of both shapes and their covariation in a single low-dimensional shape space. We use this framework to capture major modes of shape variation and to explore possible trajectories within the shape space for cells over time.


top
B11:
From TraSH to Treasure: A Novel Pipeline for Transposon Site Hybridization Analysis

Subject: Algorithm Development & Machine Learning

Presenting Author: William Harvey, The Research Institute at Nationwide Children's Hospital

Author(s):
David Newsom, The Research Institute AT Nationwide Children's Hospital, United States
Mohamed Ali, The Ohio State University, United States
Brian M. M. Ahmer, The Ohio State University, United States
Peter White, The Research Institute at Nationwide Children's Hospital, United States

Abstract:
Understanding the genetic components involved in bacterial survival and replication under different selective conditions is a fundamental problem with significant implications in microbial pathogenesis. Recent techniques have emerged to address this problem in which populations of millions of mutants are screened simultaneously for fitness defects. For example, pools of mutants can be grown in different conditions, and selective pressures reveal the essentiality and fitness of genes via the conservation or loss of mutants from the pool.

One such approach for screening mutants in parallel is transposon site hybridization (TraSH), in which microarrays are used to measure the proportion of transposon insertion mutants in each gene under different conditions. However, the resolution and quality of microarrays coupled with the need for careful data normalization and statistical analysis can prove challenging. Alternatively, the use of next-generation sequencing to measure the proportion of transposon insertion mutants in a population under different conditions (Tn-seq) achieves a much more precise picture of transposon insertion locations.

We present an optimized TraSH data analysis pipeline that fully automates discovery of genes related to bacterial fitness. Traditional microarray normalization algorithms and other tools for identifying insertion events are evaluated against Tn-seq truth data to achieve optimal classification performance. Additionally, we present a novel microarray data normalization technique based on the Laplace-Beltrami operator. Through analysis of both TraSH and Tn-Seq data we have identified a promising approach for identifying genes that contribute to bacterial fitness.


top
B12:
Phylogenetic Models of Tumor Progression from Fluorescence in Situ Hybridization (FISH) Data on Many Single Cells of a Solid Tumor

Subject: Algorithm Development & Machine Learning

Presenting Author: Salim Akhter Chowdhury, Carnegie Mellon University

Author(s):
Alejandro Schäffer, National Institute of Health, United States
Stanley Shackney, Intelligent Oncotherapeutics, United States
Darawalee Wangsa, National Institute of Health, United States
Kerstin Heselmeyer-Haddad, National Institute of Health, United States
Thomas Ried, National Institute of Health, United States
Russell Schwartz, Carnegie Mellon University, United States

Abstract:
Characterizing the common pathways through which tumors progress is critical to understanding the molecular basis of cancer and developing effective treatments. Algorithms for phylogenetics, i.e., evolutionary tree building, can be used to infer progression pathways of single tumors when there is widespread intra-tumor heterogeneity from cell to cell. A useful source of data for studying likely progression of individual tumors is fluorescence in situ hybridization (FISH), which allows one to count copy numbers of several genes in hundreds of single cells. We describe computational methods to compute likely evolutionary histories from single-cell copy number data that build on a model of tumor phylogenetics as a form of rectilinear minimum Steiner tree problem. Because of the intractability of the problem, we develop both worst-case exponential exact algorithms and an efficient heuristic. The resulting trees can then provide insights into likely tumor progression pathways as well as a source of features for use in predicting of clinically significant properties of specific tumors. We apply the methods to simulated data and a selection of FISH copy number data sets derived from various cancer types. The resulting trees lead to inferences into likely tumor progression pathways consistent with the prior literature as well as improved prediction accuracy relative to unstructured gene copy number data.


top
B13:
MDAsim: a Multiple Displacement Amplification Simulator

Subject: Algorithm Development & Machine Learning

Presenting Author: Zeinab Taghavi, Wayne State University

Abstract:
Multiple displacement amplification (MDA) is a fast non-PCR based isothermal DNA amplification method that can amplify small amounts of DNA samples to a reasonable quantity for genomic analysis. This technique is suitable for metagenomics and single cell genome sequencing and related analyses. The distribution of the coverage of the output amplicons for single cell MDA unlike the multicell amplification is not uniform. This distribution is unknown, and the parameters that affect this amplification bias have not been studied thoroughly. To have a better understanding of the MDA reaction we have developed a simplified mathematical model and the corresponding simulation algorithm called MDAsim to obtain a generative model for the output amplicons. In this paper we will show that the output coverage of MDAsim matches the experimental coverage very well. Therefore, the combination of MDAsim and a sequencing simulator is useful for testing and evaluation of single cell assemblers avoiding the burden of experimental sequencing. Our results suggest that modelling of the MDA mechanism and simulation of such increasingly complex models may provide a valuable insight into the MDA process, which in turn can be used to design more efficient MDA reactions.


top
B14:
SPATA: A Seeding and Patching Algorithm for Hybrid Transcriptome Assembly

Subject: Algorithm Development & Machine Learning

Presenting Author: Tin Nguyen, Wayne State University

Author(s):
Zhiyu Zhao, UT Southwestern Medical Center, United States
Dongxiao Zhu, Wayne State University, United States

Abstract:
Transcriptome assembly from RNA-Seq reads is an active area of bioinformatics research. The ever-declining cost and the increasing depth of RNA-Seq data have provided unprecedented opportunities to better assemble condition-specific transcriptomes. However, the nonlinear transcript structures and the ultra-high throughput of RNA-Seq reads pose significant algorithmic and computational challenges to transcriptome assembly approaches, either reference-guided or de novo strategies. While reference-guided approaches offer a good sensitivity, they rely on the alignment results of splice-aware aligners and are thus unsuitable for species with incomplete reference genomes. In contrast, de novo approaches do not depend on the reference genome but face a computational daunting task due to the complexity of the graph built for the whole transcriptome. In response to these challenges, we present SPATA, a hybrid approach to exploit an incomplete reference genome without relying on splice-aware aligners. We have designed a split-and-align procedure to efficiently localize the reads to individual genomic loci, which is followed by an accurate de novo assembly to assemble reads falling in each locus. Using extensive simulation studies, we demonstrate a high accuracy and precision in transcriptome reconstruction by comparing to the selected transcriptome assembly tools. Furthermore, SPATA outperforms the competing methods when an array of reference genomes of various qualities was used. Our method is implemented as a new module of SAMMate, a GUI software suite freely available at http://sammate.sourceforge.net.


top
B15:
HMPL: A pipeline for identifying hemimethylation patterns

Subject: Algorithm Development & Machine Learning

Presenting Author: Shuying Sun, Case Western Reserve University

Author(s):
Peng Li, Case Western Reserve University, United States

Abstract:
DNA methylation is the addition of a methyl group to the 5’ cytosine. If methylation only occurs on one strand, at a specific cytosine site, it is called hemimethylation. Recent studies show that although hemimethylated sites are found in both cancers and controls, hemimethylated sites in cancer cells tend to occur in clusters. In addition, several studies have found polarity hemimethylation patterns in individual genes. For example, at one CG site the methylation pattern on the positive strand is M (methylated), on the negative strand is U (unmethylated), while the next adjacent CG site will have a reversed hemimethylation pattern, i.e., U and M on the positive and negative strand respectively. These polarity hemimethylation patterns reveal the possibility of de novo methylation and demethylation since it is unlikely that the two hemimethylated CG sites have been both produced by the failure of methylation maintenance. All these clustering and polarity patterns were reported in single genes. It is unclear if the hemimethylation cluster patterns are seen in a genome globally. It is also unclear if there is significant difference between cancerous and non-cancerous hemimethylation patterns. To address these questions, we have developed a pipeline, named HMPL (HemiMethylation PipeLine), to identify hemimethylated sites and clusters in breast cancer genomes. Our pipeline uses perl and R scripts to align raw bisulfite sequencing reads, parse the alignment output, and summarize the final results. It identifies hemimethylation patterns in a single sample and compares two different samples (e.g., cancerous vs. normal individuals).


top
B16:
SASeq: A Selective and Adaptive Shrinkage Approach to Identify and Quantify Condition-Specific Transcripts using RNA-Seq

Subject: Algorithm Development & Machine Learning

Presenting Author: Tin Nguyen, Wayne State University

Author(s):
Nan Deng, Wayne State University, United States
Dongxiao Zhu, Wayne State University, United States

Abstract:
Identification and quantification of condition-specific transcripts using RNA-Seq is vital in transcriptomics research. While initial efforts using mathematical or statistical modeling of read counts or per-base exonic signal have been successful, they may suffer from model overfitting since not all the reference transcripts in a database are expressed under a specific biological condition. Standard shrinkage approaches, such as Lasso, shrink all the transcript abundances to zero in a non-discriminative manner. Thus it does not necessarily yield the set of condition-specific transcripts. Informed shrinkage approaches, using the observed exonic coverage signal, are thus desirable. Motivated by ubiquitous uncovered exonic regions in RNA-Seq data, termed as “uncovered exons”, we propose a new computational approach that first filters out the reference transcripts not supported by splicing and paired-end reads, then followed by fitting a new mathematical model of per-base exonic coverage signal and the underlying transcripts structure. We introduce a tuning parameter to penalize the specific regions of the selected transcripts that were not supported by the uncovered exons. Our approach compares favorably with the selected competing methods in terms of both time complexity and accuracy using simulated and real-world data. Our method is implemented in SAMMate, a GUI software suite freely available from http://sammate.sourceforge.net.


top
B17:
Gauging The Performance of Classifiers in the Context of Tumor Associated SNPs.

Subject: Algorithm Development & Machine Learning

Presenting Author: HEMA KASISOMAYAJULA, UNIVERSITY OF SOUTH CAROLINA

Abstract:
Naive Bayes classifiers (NBC) and Random Forest classifiers (RFCs)are the
preferred predicting tools
for lethality of mutations, given their speed and the option of unsupervised
learning methods.
Data associated with SNPs (SNPs: single nucleotide polymorphisms) are
increasing by gigabytes and we used two classifiers to see which one comes
closer to validated results. We then used the score to gauge lethality
and see if the lethality could be used to distinguish between passenger and driver SNPs.
We used SNPs associated with tumors in model organisms to see if they would be the same in
human oncogenes and tumor suppressors.
We used tumor suppressors, oncogenes, certain soluble kinases (serine /Throenine kinases) with validated lethal and non-lethal SNPs as our control groups.
Hormone pathways such as insulin pathways are highly conserved in mammals and we
wanted to test the efficacy of RFCs and NBC in predicting
the lethality of simulated nucleotide variants.
In all these scenarios,we conclude that the only difference between the
classifiers is speed. There is high percentage overlap between the two predictors.


top
B18:
XLPM: X-linked Peptide Mapping Algorithm

Subject: Algorithm Development & Machine Learning

Presenting Author: Mihir Jaiswal, University of Arkansas at Little Rock

Author(s):
Nathan Crabtree, University of Arkansas at Little Rock, United States
Roger Hall, Dajali Informatics, United States
Boris Zybaylov, University of Arkansas for Medical Sciences, United States

Abstract:
Mass spectrometry is a promising tool for protein-protein interaction analysis using chemical cross-linking. A better algorithm was needed for data analysis, to include functionalities for longer sequences, file formats other than .pkl, more than two sequences, semi-specific and non-specific cross-linkers, forced missed cleavage, tandem MS data and O18 radio-labeled peptide spectra. The XLPM algorithm was designed in Perl and MySQL to incorporate above mentioned additional functionalities. The new algorithm takes into account amino acid sequences of proteins, digesting enzyme, chemical cross-linker and missed cleavage level for analysis. In addition, XLPM can handle any number of static as well as variable modifications. It is considerably faster. It identifies cross-linking sites, performs in-silico digestion, calculates the mass values of each combination of fragments with cross-linker, filters them, and matches theoretical masses with experimental masses to identify cross-linked peptides. It also matches tandem MS fragmentation profile to find correct cross-linked species associated with corresponding precursor mass peak. XLPM can identify cross-linked peptide from mass spectrometry data fast and efficiently based on hypothesis, “Detection of a particular b ion with the charge less than that of a precursor implies high probability of detection of the complementary y-ion with remaining charge”. Thus, XLPM can serve in the first step of protein-protein interaction. XLPM will be made available as a free online tool as well as standalone tool.


top
B19:
SCOPE++: Sequence Classification of homoPolymer Emissions

Subject: Algorithm Development & Machine Learning

Presenting Author: James Morton, Miami University

Author(s):
John Karro, Miami University, United States
Chun Liang, Miami University, United States

Abstract:
mRNA polyadenylation, the addition of a poly(A) tail to the 3’-end of pre-mRNA, is a process critical to gene expression and regulation in eukaryotes. To understand the molecular mechanisms governing polyadenylation and other relevant biological processes, it is important to identify these poly(A) tails accurately in transcriptome sequencing data. As the post-transcriptional products, these poly(A) tails need to be accurately identified and trimmed for downstream analysis. But the annotation of these tails is complicated by the presence of post-transcriptional modifications and sequencing errors. Conventional seed-and-extend algorithms struggle to accurately identify these poly(A) tails . To address this problem we have created SCOPE++. Based on a Hidden Markov Model (HMM) approach, SCOPE++ is developed to accurately identify homopolymers in error-prone mRNA sequence reads. Further, SCOPE++ is capable of tailoring its computational model to the characteristics of the dataset through machine learning techniques. In a series of both human-validated and simulation-based tests, we demonstrate that our tool can precisely identify poly(A) tails with near perfect accuracy at the speed needed for high-throughput applications, providing a valuable resource for polyadenylation research.


top
B20:
The impact of spatial proximity on eQTL associations

Subject: Biological Networks

Presenting Author: Geet Duggal, Carnegie Mellon University

Author(s):
Hao Wang, Carnegie Mellon University, United States
Carl Kingsford, Carnegie Mellon University, United States

Abstract:
eQTLs are SNP-gene pairs where the change in a gene's expression is statistically correlated with a SNP. When SNPs are in regulatory regions near genes, the mechanism for how the mutation affects gene expression could be explained by genomic proxmity. However, more than 80% of SNPs in a recent collection of eQTLs are at least 50kbp away from their correlated gene (distal eQTLs). For these eQTLs, the mechanism for how the mutation modulates gene expression is less clear. One plausible explanation for this association is the spatial looping of DNA where far-away regulatory elements loop onto the gene thus affecting its expression. However, little is known about the genome-wide influence of spatial proximity on these distal eQTLs. Recent 'HiC' data provides interaction counts between fragments of DNA throughout the genome, and for genomes of various cell lines. To correlate chromatin structure with eQTLs on a genome-wide scale, we developed stringent statistical tests and demonstrate that, on the whole, high-frequency HiC interactions are enriched in eQTLs and confirm that a number of known distal eQTLs are spatially proximate. These eQTLs overlap with spatially proximate eQTLs from targeted studies, and also serve as predictions for novel spatially proximate eQTLs. Our results suggest that the overall structure of chromatin plays an important role in explaining how long distance mutations modulate gene expression.


top
B21:
Exact Fermi-Dirac Statistics for Bacterial Transcription Factors

Subject: Biological Networks

Presenting Author: Patrick O'Neill, University of Maryland Baltimore County

Author(s):
Ivan Erill, University of Maryland Baltimore County, United States

Abstract:
Efficient transcriptional regulation in bacteria requires a balance
between precision and tolerance: transcription factors must rapidly
locate and bind cognate sites within a narrow band of binding
affinities despite fluctuations in copy number and other cellular
noise. Hence, a full understanding of transcriptional regulatory
networks will require accurate binding models for the prediction of
binding probabilities over a complete chromosomal sequence. Much
early work on the design principles underlying transcription factor
binding dynamics has been limited by simplifying assumptions
concerning the genomic environment. In particular, it is common for
models to neglect sequence data, and to treat the transcription factor
either as a single copy or a multiple-copy system approximated by an
ideal gas.

In this work we present a method for the exact solution of the
Fermi-Dirac statistics for an arbitrary TF copy number bound to real
genomic sequence. We compare our results to the ideal gas
approximation and show considerable disagreement in the range of
physiologically relevant TF concentration for many bacterial
regulatory systems. We find that the idealized model consistently
underestimates binding probabilities, and that this effect is most
apparent in large regulons. The exact solution presents a method for
inferring regulon size from a collection of known sites and the
genomic sequence: through comparison of real genomic sequence to
ensembles of random controls, we can compute the concentration at
which real chromosomes effectively saturate. We also validate the
binding predictions of our model on several Chip-Seq datasets and
compare its performance to existing alternatives.


top
B22:
Systematic Investigation of the Evolution of Protein Complexes of the Bacterial Cell Envelope

Subject: Biological Networks

Presenting Author: Cedoljub Bundalovic-Torma, University of Toronto

Author(s):
John Parkinson, Hospital for Sick Children, Canada

Abstract:
The cell envelope of gram-negative bacteria serves as an important interface between an organism and its environment. Numerous gram-negative bacteria are important human pathogens, and an increasing number are becoming resistant to traditional antibiotic treatments. Thus the study of the cadre of proteins associated with the cell envelope may provide novel drug targets for future therapeutics. The advent of genomic sequencing coupled with improvements in experimental methodology present us with a novel opportunity to elucidate the evolution of the cell envelope, particularly of integral protein complexes, and furthermore to elucidate with greater detail how bacteria survive and adapt to diverse environmental niches.

It has become appreciated that bacterial genomes undergo a substantial degree of evolutionary flux, and can yield insight to how bacterial species adapt to particular environments and lifestyles. These processes range from sequence divergence of conserved genes, to their duplication and subsequent functional divergence, horizontal transfer, as well as genome reduction. Comparative genomics approaches have revealed that proteins involved in the interaction of bacteria and their environment, i.e. those of the cell envelope, are particularly influenced by these processes. However, no large-scale systematic studies have thus far been performed to address how these processes affect the evolution of cell envelope complexes across gram-negative bacteria of diverse phylogeny and lifestyle. We show that by employing a computational approach we can gain novel insight of the evolution of the cell envelope and increase our understanding of bacterial pathogenesis.


top
B23:
Sequence Analysis and Molecular Pathway Prediction of Sirtuin in Bacillus megaterium

Subject: Biological Networks

Presenting Author: Baraka Williams, Jackson State University

Author(s):
Victoria Gilmore, Center for Bioinformatics/ Jackson State University, United States
Shaneka Simmons, Center for Bioinformatics/ Jackson State University, United States
Andreas Mbah, Center for Bioinformatics/ Jackson State University, United States
Bianca Garner, Division of Natural Sciences/Tougaloo College, United States
Hugh Nicholas, Pittsburgh Supercomputing Center, United States
Raphael Isokpehi, Center for Bioinformatics and Computational Biology/ Jackson State University, United States

Abstract:
Bacillus megaterium is gram-positive and the largest bacteria found in soil. B. megaterium has been characterized as a biotechnologically- relevant organism due to its use in medicinal, scientific and industrial advancement. One important known feature of B. megaterium is the ability to survive extreme environmental conditions by forming endospores. Genes encoding the universal stress protein (USP) domain enable cellular stress survival and are often in conjunction with associated protein family domains. In B. megaterium, a two-gene transcriptional unit that consist of genes for a universal stress protein and a sirtuin (Sir2), NAD+ dependent deacetylase which are important for regulating rDNA, silencing genes and stabilizing chromosomes. Members of sirtuins prolong lifespan and are considered longevity genes in yeast, nematode, mouse and fruit fly. The purpose of this study was to determine the phylogenetic relationships of bacteria sirtuins. Sequences were retrieved from Protein Information Resource (PIR) and processed through the CD-HIT program in Galaxy, bioinformatics workflow system. The multiple sequence alignment generated by T- Coffee was visualized in GeneDoc. The phylogenetic tree was constructed using Figtree and Notung. Multiple sequence alignment led to the identification of unique sequence features. The predicted two-gene transcriptional unit could be key for survival of the industrially important Bacillus megaterium during nutritional stress.


top
B24:
A systematic evaluation of unaligned ChIP-Seq reads

Subject: Biological Networks

Presenting Author: Zachary Ouma, Ohio State University

Author(s):
Erich Grotewold , Ohio State University , United States
Maria Mejia-Guerra, Ohio State University , United States

Abstract:
Chromatin immunoprecipitation followed by massively-parallel sequencing (ChIP-Seq) is an indispensable tool in understanding dynamics and evolution of regulatory circuitry of prokaryotes and eukaryotes. ChIP-Seq studies aim to decipher gene regulatory mechanisms by mapping genome-wide transcription factor binding sites (TFBSs). Aligning millions of short sequence reads to the reference genome is one of the fundamental steps in the ChIP-Seq data analysis pipeline. This is carried out in order to identify significantly enriched regions, which would correspond to putative TFBSs. The source of unaligned reads in ChIP-Seq studies has not been systematically explored. While it is not uncommon to observe a fraction of the total number of reads aligning to the reference genome, significantly low alignment proportions affect identification of TFBSs, thereby undermining the process of deciphering true gene regulatory networks (GRNs). We describe a computational approach to establish the source of unaligned ChIP-Seq reads from several major model organisms. The analysis of raw sequence reads revealed a significant level of contamination in ChIP-Seq unaligned reads with sequences of bacterial and metazoan origin, irrespective of the source of chromatin used for the ChIP-Seq studies. Unexpected, however, was the observation that selected unaligned reads data sets contained significant numbers of legitimate reads that have mappable properties but were missed out by researchers in the alignment process. More interestingly, new TF enrichment sites were identified when the set of recovered was subjected to a typical ChIP-Seq peak-calling analytical procedure. This highlights a need to improve the currently utilized alignment algorithms.


top
B25:
In silico high-throughput screening of versatile P-glycoprotein inhibitors using polynomial empirical scoring functions

Subject: Chemical Biology

Presenting Author: Sergey Shityakov, Würzburg University

Abstract:
P-glycoprotein is an ATP-binding cassette transporter that causes multidrug resistance of various chemotherapeutic substances by active efflux of them from mammalian cells. P-gp plays a pivotal role in limiting drug absorption and distribution in different organs including intestines and brain. Thus, the prediction of P-gp-drug interaction is of vital importance for assessing drug pharmacokinetics and pharmacodynamics properties. In order to aid in the discovery of strongest P-gp blockers, we performed a high-throughput in silico screening of 1302 different P-gp inhibitors by genetic algorithm using polynomial empirical scoring functions (polscore). We report the strong correlation (r2=0.82, p<0.0129, n=6) of experimental IC50 (pIC50) values with polscore-predicted inhibition constants (Ki/pKi) using the linear regression fitting technique. The hydrophobic interactions between P-gp and drug substance are mainly responsible for the inhibition effect. The results show that this scoring technique might be useful in virtual screening and filtering of compound databases at the early stage of drug development process.


top
B26:
Receptor Chemoprint-based 3D-QSAR Pharmacophore Approach for Development of Monocyclic A2AR Antagonists

Subject: Chemical Biology

Presenting Author: Amresh Prakash, University of Delhi

Author(s):
ChandraBhushan Mishra, University of Delhi, India
Namrata Kumari, University of Delhi, India
Nitin Kumar, University of Delhi, India
Pratibha Mehta Luthra, University of Delhi, India

Abstract:
Due to lack of selectivity and bioavailability, several A2AR antagonists in therapy of Parkinson’s disease failed in clinical trials. In the present work, we aimed to find selective and potent A2AR antagonist/s with significant bioavailability. Receptor-chemoprint based 3D-QSAR-Pharmacophore models used to design of thioxo-thiazolo as non-xanthine novel monocyclic A2AR antagonist. Novel hits having two lipophilic hydrogen bond acceptor, one hydrophobic aliphatic/aromatic group and one ring aromatic are defined as essential pharmacophoric features for monocyclic A2AR-antagonist. To illustrate receptor subtype selectivity, pharmacophore models were generated for all four adenosine subtypes (A1R, A2AR, A2BR and A3R). Novel monocyclic compound (E)-4-(4-bromobenzylideneamino)-3-phenyl-2-thioxo-2,3-dihydrothiazole-5-carbonitrile (ACPT) was designed and synthesized, and in vitro and in vivo studies were carried to affirm their insilico results. The results of radioligand-binding assay of ACPT with human A2AR expressed in HEK293T cells (Ki = 0.004nM; A1R/A2AR selectivity: 46500) and A2AR-coupled release of endogenous cAMP (0.6pmol/ml) with NECA (A2AR agonist) treated HEK293T cells in functional assay are in coherence with predicted studies (ΔG= -11.26 Kcal/mol and Ki = 0.06 nM; selectivity: 891/0.06= 14850). Moreover, ACPT-pretreated haloperidol-induced Swiss albino male mice showed reduction in catalepsy (motor impairment). These results suggested that ACPT could be explored as novel potent and selective monocyclic A2AR antagonists in treatment of PD. Precisely, receptor-chemoprint and ligand based combined strategy has been used first time for generation of pharmacophore models for the rational design of monocyclic A2AR antagonists.


top
B27:
Mobile Interaction and Query Optimization in a Protein-Ligand Data Analysis System

Subject: Databases & Ontologies

Presenting Author: Marvin Lapeine, Montclair State University

Author(s):
Katherine Herbert, Montclair State University, United States
Nina Goodey, Montclair State University, United States

Abstract:
In studying usage of such a tool, the benefits of a mobile platform have become evident. While many practitioners have integrated technology readily into their experimental environment, many others are working in environments that are not conducive to a computational platform. Moreover, our team has observed the usefulness of the model and is looking at possible extensions of this work into understanding invasive species studies and species survival studies in particular identified regions. When considering the environments that this platform has the potential for working within, mobility becomes a key concern. Considering the nature of this platform, where the user needs to see potentially massive data sets visualized with a histogram, both querying and visualization issues become concerning. When adding the invasive and survivability of species applications, these problems become compounded due to the data collection environment. Even the most computationally sophisticated platforms offer challenges in addressing this problem. This poster presents our observations and the solutions we have attempted.


top
B28:
Comparing a few SNP calling algorithms using low-coverage sequencing data

Subject: Clinical Informatics & Epidemiology

Presenting Author: Xiaoqing Yu, Case Western Reserve University

Author(s):
Shuying Sun, Case Western Reserve University, United States

Abstract:
Several single nucleotide polymorphism (SNP) calling programs have been developed to identify novel SNPs and mutations in next generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNP calling. Moreover, commonly used SNP callers usually include several metrics for each potential SNP in their output files. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs to do any further experimental validation. To compare the performance of SNP callers in a low coverage sequencing dataset, we first compare the SNP calling results generated from four algorithms, SOAPsnp, Atlas-SNP2, samtools, and GATK, without any post-output filtering. We have a few findings. First, we find that SOAPsnp calls more SNPs than other algorithms since it has little internal filtering criteria. However, Atlas-SNP2 reports the least number of SNPs since it has stringent internal filtering criteria. Second, using several cutoff values for the sequencing coverage of called SNPs, we find that filtering the SNPs with a higher coverage threshold improves the agreement among the four algorithms. Third, we explore the values of a few key metrics in each algorithm, and use them as post-output filtering criteria to maximize the agreement of SNP findings among algorithms. Our exploratory results show that high coverage regions or bases tend to have high calling qualities. We recommend the users to employ more than one SNP calling algorithm, and use coverage and calling quality as filtering criteria for reliable SNP identification.


top
B29:
Computational analysis of BMP-receptor and BMP-antagonist interactions with a view to designing BMP super-agonists and dominant negatives

Subject: Disease Models & Molecular Medicine

Presenting Author: Satnam Surae, University College Dublin

Author(s):
Finian Martin, University College Dublin, Ireland
Jens Nielsen, University College Dublin, Ireland

Abstract:
Bone Morphogenetic Proteins are integral regulators of bone and organ development and are expressed, with their antagonists, in fibroses. BMPs are secreted proteins and signal by associating with membrane bound receptors. BMPs action is modulated by a family of secreted protein antagonists, including Noggin, Crossveinless-2 and Gremlin that limit BMP-receptor association by binding to the ligand and thereby inhibiting receptor binding. We have analysed BMP-receptor and BMP-antagonist interactions using a newly designed automated pipeline, Protein Complex Tool (PCT). Co-crystal structures of the BMP-BMP receptor and the BMP-antagonist complexes were submitted to PCT and in silico alanine substitution scans were performed to calculate the free energy contribution of each BMP-2, or BMP-7, residue to the stability of the complexes with receptors, BMPRIa (BMP-2 only) and ActRIIa, and antagonists, Crossveinless-2 (BMP-2 only) and Noggin (BMP-7 only). The free energy calculations identified the key contributions of BMP residues to both binding events and suggested mutations that might generate super-agonist and dominant negative molecules. Further in silico analysis was performed by mutating each residue to each of the other 19 amino acids. From this we identified potential super-agonist and dominant negative mutations for both BMP-2: L51V and N102T (super-agonists) and S88G and L92D (dominant negatives), and BMP-7: E60T, D119I, I124A and K127E (super-agonists) and F117E and V122D (dominant negatives). The super-agonists will bind and activate receptor but will be resistant to binding by antagonist; in contrast, the dominant negatives, bind antagonist but not receptor. Both have potential as therapeutic leads for treating fibrotic diseases.


top
B30:
Rate of Operon Evolution in Proteobacteria

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: David Ream, Miami University

Author(s):
Iddo Friedberg, Miami University, United States

Abstract:
How operons evolve in an open question in genome evolution, and several models have been proposed to explain observed evolutionary patterns. However, the community lacks a method to describe operon evolution, which is necessary for a large-scale examination of the forces shaping this genomic feature. We propose a computational method to classify operons by their evolutionary trajectory. This method will enable us to examine how operons evolve on a case-by-case basis, and classify their evolutionary parameters.

We propose that the construction and/or destruction of clusters of co-transcribed genes can be described as a sequential series of defined events. By examining these events we can employ statistical learning to classify evolutionary paths of operons, and connect these paths with biological function. At the most basic level, clusters of genes can undergo a few basic operations. Clusters can be broken into smaller groups, and constituent genes can be duplicated, deleted, rearranged, or fused.

Using a set of 36 proteobacteria species and 46 different experimentally verified E. coli operons, containing at least five genes, we have examined the frequency of events for the study of operon evolution. Event tracking has allowed us to compare the relative rate of evolution against evolutionary time. Some operons appear to have consistently high or low frequency of events making them fast or slow evolving, respectively. Slow evolvers comprise essential gene complexes, such as ATPsynthase and components of the transcriptome. Fast evolving operons tend to be non-essential, like the utilization of alternate catabolites and certain transporters.


top
B31:
Informational constraints on transcription factor-binding site coevolution

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Robert Forder, University of Maryland Baltimore County

Author(s):
Patrick O'Neill, University of Maryland Baltimore County, United States
Ivan Erill, University of Maryland Baltimore County, United States

Abstract:
The transcriptional program of living organisms is implemented by transcription factors (TF), which modulate gene activity by binding selectively to short DNA sequences (sites) within the promoter region of target genes. In spite of their fundamental importance, models of TF-DNA binding and of the co-evolution of TFs and their binding sites remain fairly underspecified due to the lack of consistent data on transcription factor-binding across multiple species, and incorporate several simplifying assumptions that have not been consistently supported by empirical data. Here we put forward a reverse engineering approach based on a genetic algorithm and established biophysical models to analyze quantitatively the evolution of transcription factors and their cognate sites in realistic genomic environments. This allows us to test several hypotheses and assumptions about the evolution of these systems, and to gauge the relative importance of informational and biochemical constraints on the coevolution of TFs and their binding sites. We use the evolutionary simulator to assess the evolution of essential properties of TF-binding sites, such as the distribution of information, site length or motif symmetry, as a function of transcription factor copy number, biophysical models and regulatory constraints. Our results indicate that some frequently observed traits in TF-binding motif, such as symmetry, are mostly the result of biophysical constraints. In contrast, features like the distribution of information appear to be mainly due to informational requirements. We compare these results with available experimental evidence in bacteria and we discuss their implications for the analysis of transcriptional networks and their evolution.


top
B32:
Influence of resampling on the estimates of synonymous and nonsynonymous divergence in the HA gene of influenza A.

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Mary Halpin, Kent State University

Author(s):
Helen Piontkivska, Kent State University, United States
Nina Pollard, Kent State University, United States

Abstract:
International travel and confined spaces of airplanes and trains facilitate the rapid spread of respiratory diseases like influenza A virus that spread through respiratory droplets and shared surfaces. Furthermore, large-scale industrial farming and close quarters of animals and humans that interact with animals enable viral species jumping through antigenic shifts that later spread throughout human population. HA gene that encodes a protein interacting with the cell surface receptors plays a vital role in driving species specificity and its changes. Antigenic shifts are associated with major changes to specific HA amino acid residues that allows the virus to shift its affinity from animal to human receptors. Thus, it is critical to adequately sample the strains currently in circulation across a broad range of geographical locations of both human populations as well as species that can serve as reservoirs to recognize the early signature of potential pandemic strains. However, the existing databases have inherent sampling biases toward certain geographical areas and/or certain years (e.g., 2009). Because average levels of synonymous and nonsynonymous divergences are often used as tools to identify trends in influenza evolution, presence of outliers may lead to biased estimates. In this study we examine the impact of sequence resampling on the estimates of various divergence patterns and subsequent inferences using publicly available influenza A HA gene sequences.


top
B33:
Coding length matters in molecular evolution

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Seung Ho Shin, Department of Medical Biotechnology, College of Biomedical Science, and Institute of Bioscience & Biotechnology, Kangwon National University, Chuncheon 200-701, South Korea

Author(s):
Seung Gu Park, College of Biomedical Science, and Institute of Bioscience & Biotechnology, Kangwon National University, Korea, Rep

Abstract:
Coding sequences evolve at different rates. Several variables responsible for the difference of evolutionary rate in protein sequences have been studied using the completed genomes and various forms of high-throughput datasets from prokaryotes to mammals. Followings are the most studied evolutionary-rate-correlative variables; essentiality, gene compactness, expression level, expression breadth, protein-protein interaction, pleiotropy, propensity of gene loss, and recombination. Some variables have positive correlations while others have negative correlations with evolutionary rate. Unfortunately, most of these studies regarding to searching for the primary evolutionary-rate- determinants have produced contradictory observations mainly due to noisy and incomplete data. A few theoretical explanations have been suggested to explain why a specific correlation between a variable and evolutionary rate should be existed in a given dataset. Coding length has not been a popular variable studied on the issue of the determinants of evolutionary rates. In the present study, we thus try to deal with the coding length matter directly to the relationship with the rate of evolution. Using a fixed group analysis that we have designed in previous study, we try to exclude the effect of other variables such as expression breadth or expression level in the correlation between coding length and evolutionary rate. We also try to refine the effect of two different length-related variables, exon length and exon number, in influencing the rate of evolution.


top
B34:
Sex and Multicellularity: A Plan for Detection of Recombination in Choanoflagellates

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Ashley Wain, The University of Akron

Author(s):
Ethan Knapp, The University of Akron, United States

Abstract:
Sex has yet to be found in choanoflagellates, but recent studies have shown that the process likely occurs in Monosiga brevicollis. The presence of recombination hotspots will be assessed in M. brevicollis using RPD3, a non-parametric recombination detection program incorporating BOOTSCAN, MAXCHI, AND 3SEQ. It analyzes alignments 3 at a time for discordant branching patterns. Sequences obtained from GenBank, the Broad Institute, cooperating labs, and in house sequencing of ATCC50154 and PRA-258 will be aligned with Sequencher. The probability of any discovered recombination location occurring by chance is approximated using the binomial distribution and a Bonferroni correction is included for multiple comparisons.
Located hotspots will be included as focal regions in an experiment to detect recombination in mixed cultures of 2 M. brevicollis clones. Two variable sites (by ≥ 1 bp) at least 1Mb apart will be sequenced and primers developed so that amplicons differ substantially in size. Peptide nucleic acids (PNAs) will overlap variable portions and will be complimentary to PRA-258 (wt). Due to size differences (short out-amplifies long) and inclusion of PNAs (amplification clamped in the presence of the complimentary PNA), strains will have unique banding patterns from PCR products. PNA-mediated absolute qPCR clamping will be used to ensure precision and identification of potentially minute differences between parent clones and possible recombinants. A long-amplicon band is expected from reactions containing both primer sets, both PNAs, and recombined DNA. This band cannot, theoretically, result from either parent clone.


top
B35:
Efficiencies of identifying interacting genomic regions using evolutionary rate co-variation method using different number of sequences

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Madara Hetti Arachchilage, Kent State University

Author(s):
Porsha Frazier, Kent State University, United States
Helen Piontkivska, Kent State University, United States

Abstract:
The rapid rate of mutations in HIV-1 leading to escape from the immune system and drugs remains a major challenge in development of effective vaccine. We showed earlier that highly conserved regions from different HIV-1 genomic regions that frequently co-occur together - and likely interact with each other - can serve as potential targets for multi-epitope vaccines (e.g., Paul and Piontkivska, 2010). HIV-1 Reverse Transcriptase (RT) and Integrase (IN) enzymes are responsible for catalyzing the essential steps of reverse transcription and integration, respectively, during its life cycle. They also appear to interact with each other, although the molecular mechanisms remain to be elucidated. A new method was recently proposed to identify physically or functionally interacting protein pairs based on evolutionary rate co-variation where individual branch rates of a protein are compared with corresponding branch rates of other interacting protein (Nathan et al, 2012). However, it is unclear how well this method will work when the extent of sequence divergence is relatively small and/or the number of sequences used is large. In this simulations study we examine efficiencies of identifying interacting pairs using datasets of various sizes by resampling available HIV-1 Pol gene sequence data. Our results show that correlation coefficients tend to be higher when the rates of nonsynonymous substitutions are used compared to the branch length estimates that also take into account synonymous changes. Overall, examining the efficiencies of this approach under different simulated datasets will be important for future applications of the method to identification of interaction regions.


top
B36:
Bioinformatic Characterization of Orf6 – A Thioesterase; Prediction and Experimental Verification

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Maria M. Rodriguez-Guilbe, University of Puerto Rico School of Medicine

Author(s):
Ricardo González-Méndez, University of Puerto Rico School of Medicine, United States
Troy Wymore, Oak Ridge National Laboratory, United States
Alexander Ropelewski, Pittsburgh Supercomputing Center, United States
Eric Schreiter, Howard Hughes Medical Institute, United States
Abel Baerga-Ortiz, University of Puerto Rico School of Medicine, United States

Abstract:
Polyunsaturated fatty acids, PUFAs, are made in nature by a polyketide synthase (PKS) system. When the polyketide chain has reached the required length, a thioesterase (TE) domain is responsible for release of the reaction products. To date no TE domain has been identified involved in PUFA biosynthesis for marine organisms. A good candidate is the orf6 gene from P. profundum, which is located directly upstream of the PUFA gene cluster. Bioinformatic sequence analyses were performed in order to determine if thioesterase characteristic residues and sequence motifs were present in Orf6. BLAST searches, multiple sequence alignments with PSI-Coffee, and motif elicitation using Maximum Entropy (MEME) and InterProScan results showed that the Orf6 protein belongs to the 4-hydroxybenzoyl-CoA (4HBT) TE family, which is characterized by having a hot-dog fold with a conserved Asp17 in the active site. This is consistent with the Orf6 crystal structure we determined by X-ray crystallography. Motif analysis showed two large well-conserved motifs including the 4HBT motif. Evolutionary analysis using PHYLIP resulted in a phylogenetic tree with three clusters, suggesting the evolution of these genes into sub-families related to the surrounding environments. Docking simulations using AutoDockVina found low energy and biochemically relevant conformations showing that Orf6 prefers long-chain fatty acids over short-chain fatty acid. Isolation and biochemical characterization using thioesterase activity assay provided verification of thioesterase activity.


top
B37:
Chromatin regulated targeting of p63 transcription factor

Subject: other

Presenting Author: Isha Sethi, University at Buffalo (The State University of New York)

Author(s):
Michael Buck, University at Buffalo (The State University of New York), United States
Satrajit Sinha, University at Buffalo (The State University of New York), United States

Abstract:
The human transcription factor p63 is a key player in the development and differentiation of epidermis. Indeed p63 knockout mice show a severe phenotype of no eyelids, no hair, limb truncations and profound alterations of skin. Recent ChIP-Seq studies in human primary keratinocytes have identified over 10000 p63 binding sites in the genome. How and why p63 targets these specific sites from over 1 million potential p63 binding sites is unclear. In this study we use p63 ChIP-Seq datasets from Kouwenhoven et al (2010) and chromatin modification datasets from ENCODE to address these questions. We find that the chromatin architecture of p63 bound enhancers differ from p63 bound proximal promoters. H3K4me mark is almost absent at the promoters but expressed at the enhancers. Also H3k4me3 and H3k9ac are more strongly expressed at promoters than at enhancers. The chromatin architecture at p63 targets can predict p63 binding by discriminant analysis, suggesting that functional p63 binding sites are characterized by a chromatin signature of H3k4me, H3k4me2, H3k4me3, H3k9ac and H3k27ac marks at their binding sites. In addition, the majority of the 10000 in vivo p63 binding sites lack or have a severely degenerative p63 binding sequence. These sites are enriched for other transcription factors which likely represent new uncharacterized co-factors. Furthermore, we integrated RNA-Seq experiments and found that only 10% of p63 target genes seem to require p63 for their expression.


top
B38:
BLASTing with Chromatin Architecture: A Novel Method of Genomic Functional Element Identification and Annotation

Subject: Personalized Genetics & Genomics

Presenting Author: William Lai, SUNY Buffalo

Author(s):
Michael Buck, SUNY Buffalo, United States

Abstract:
​The sequencing of the human genome was the first step in understanding the enormous complexity of the genomic regulatory complex. Approximately 3 billion base pairs encode the raw instructions for synthesizing every human cell, tissue, and organ. The key to understanding how this list of instructions can be interpreted so differently between cell types has only recently begun to be deciphered. The identification of genomic functional elements, ie. promoters and enhancers, has been an integral part of teasing apart the complex regulatory networks. However, finding these locations within the genome can be a laborious and expensive undertaking requiring site specific assays. Even more difficult is identifying entirely new classes of genomic features. In order to facilitate identification and characterization of new classes of genomic features, we have developed and implemented chromatin Architecture Basic Local Alignment Search Tool (ArchBLAST). The ArchBLAST algorithm utilizes conserved chromatin architecture at known sites of interest and globally searches the genome for similar sites. ArchBLAST differs from other approaches in that it uses the amplitude and spatial arrangement of chromatin modifications to score similarity. Importantly ArchBLAST allows for identification of subtypes of known genomic features and can accurately predict previously uncharacterized locations. We have validated the accuracy of our approach with multiple well characterized genomic features from yeast and humans.  We show ArchBLAST is capable of predicting both gene expression and genomic feature directionality as well as identifying cell-type specific enhancers using only chromatin architecture.


top
B39:
GMOL: A Tool for 3D Genome Structure Visualization

Subject: Personalized Genetics & Genomics

Presenting Author: Chenfeng He, University of Missouri Columbia

Author(s):
Avery Wells, University of Missouri Columbia, United States

Abstract:
We developed a tool, GMOL, to visualize genome tertiary structure. In order to effectively visualize large-scale structures, such as the human genome, which consists of nearly three billion DNA base pairs, we adopted a multi-scale visualization strategy. Specifically, GMOL uses six separate scales (using the human genome as an example): genome scale, chromosome scale, loci scale, fiber scale, nucleosome scale and nucleotide scale. A new file format was designed to store the data of the six different scales. GMOL allows a user to browse the 3D structure and sequence at different scales. With GMOL, a user can choose a unit / point at any scale and scale it up or down to visualize its structure and retrieve corresponding genome sequences from either Ensembl or a local database. Also, GMOL allows user to interactively manipulate and measure the whole genome structure and extract static images and machine-readable data files in the PDB format from the multi-scales structure.


top
B40:
The Transition to Clinical NGS: How Well Do You Know Your Sequencing Pipeline?

Subject: Personalized Genetics & Genomics

Presenting Author: Donald Corsmeier, The Research Institute at Nationwide Children's Hospital

Author(s):
Ben Kelly, The Research Institute at Nationwide Children's Hospital, United States
Peter White, The Research Institute at Nationwide Children's Hospital, United States

Abstract:
As next generation sequencing technologies are rapidly adopted in clinical genetics settings, higher expectations must be placed on the bioinformatics pipelines used to transform hundreds of gigabytes of raw read data into a few lines of meaningful genetic variation that is useful to the clinician. Reducing these data by several orders of magnitude is a complex process rife with the capacity for error and inadvertent deviation from a best practices approach. Notwithstanding these potential pitfalls, even if secondary analysis is performed correctly using accepted methods parallelization and down-sampling techniques can introduce nondeterminism in the resulting variant call set. This brings into question the utility of a given computational approach in a clinical setting.
Commited to the repeatability of test results based on discrete digitized genomic data, we investigated three pipelines that use one of the most popular software combinations for secondary analysis: Burrows-Wheeler Aligner (BWA) for short read sequence alignment together with the Genome Analysis ToolKit (GATK) for variant calling and genotyping. Somewhat surprisingly, we discovered that nondeterminism can be introduced at virtually every step in the analysis if configuration parameters are not carefully selected. Of the analysis approaches tested, only Churchill preserved determinism while adhering to best practices, regardless of the level of parallelization. Further, we demonstrate that Churchill’s speed and precision do not come at the expense of quality of the output variant call set. Of the three full pipelines evaluated, only Churchill achieves identity to a single-threaded serial implementation.


top
B41:
A Computational framework for assessing mRNA-miRNA interactions in Ischemia

Subject: Databases & Ontologies

Presenting Author: Maha Soliman, University of Louisville

Author(s):
Kalina Andreeva, University of Louisville, United States
Nigel Cooper, University of Louisville, United States

Abstract:
Ischemia-Reperfusion (IR) injury has been implicated in numerous retinal disorders, the etiological expression of which can be associated with various protein coding and non-coding genes. While much progress has been made towards identification of genes and pathways associated with retinal disorders, the regulatory mechanisms for their coordinated expression are just beginning to be uncovered. Our laboratory has generated microarray mRNA and miRNA microarray data using a rat model of IR injury in the retina for multiple post-ischemic time points. Our goal is to develop a pipeline for linking miRNA to gene expression and cellular processes. A correlation analysis was carried out on miRNAs and their target genes in our datasets based on expression values over the post-ischemic time points. The correlation analysis approach provides a strategy that may be compared to algorithms that rely on sequence targeting database.


top
B42:
A Survey of Protein Storage and Search Cost Reduction Techniques

Subject: RNA & Protein

Presenting Author: Jeff Chapman, University of Akron

Abstract:
To determine the biological functions of a newly sequenced protein, one often compares it against a database of existing proteins to find sequences which share substructures or motifs. Because the function of protein is highly correlated with its structure, finding similar sequences which have known biological function is a quick way to understand a new protein's likely functions.

Two primary concerns for comparing protein sequences are computation time and storage requirements. The most accurate algorithms to find similar protein strands use a dynamic programming approach, but this is computationally expensive. Modern methods attempt to reduce the time complexity by using heuristics or data transformations to decrease the amount of data examined. These approaches provide only approximate matches, but this is acceptable because often a group of similar protein sequences should be examined to determine the new sequence's properties.

In this research, a group of exact and approximate algorithms for selecting similar protein sequences from a database of proteins with known biological functions to the one under question is examined. Search speed relative to size of protein and size of database along with how the match sets overlap are factors examined. Modern compression algorithms are examined for their efficiency in reducing storage requirements and their (de)compression time Algorithms and techniques examined include Smith-Waterman, BLAST, and prediction by partial matching. An application of a music thumbnailing algorithm to protein searching will be presented. The pros and cons of these algorithms will be presented.


top
B43:
Enhanced Protein Domain Identification by Using Conditional Context Scores

Subject: RNA & Protein

Presenting Author: LING ZHANG, University of Nebraska-Lincoln

Author(s):
Etsuko Moriyama, University of Nebraska-Lincoln, United States
Shunpu Zhang, University of Nebraska-Lincoln, United States

Abstract:
Protein domains are the structural, functional, and evolutionary units of proteins. Identification of domains is important in annotation of protein structures and functions. Several methods have been developed to identify domains based on structural classes (SCOP) or conserved sequences (Pfam). However, many domains can be still missed from highly divergent proteins. As Ochoa et al. (2011) has done with their Domain Prediction Using Context (dPUC) method, domain co-occurrence information can be exploited to improve domain detection sensitivity. In dPUC, context scores are calculated through counting the numbers of domain pairs in the Pfam dataset, where each domain in domain pairs has the same ability to boost detection of another domain of the pair. However, some domains exist repetitively in a protein and other domains can be found in different combinations in many different proteins. These high frequency domains causes high domain pairwise scores regardless of the domains paired. It introduces incorrect boosting in detection of low frequency domains that are paired with high frequency domains. This results in increased false discovery rates (FDRs) in domain detection. To overcome this problem, we propose a domain identification method that incorporates conditional context scores. We applied our approach for the Plasmodium falciparum protein set, and showed improved performance compared to the original dPUC method. At a fixed FDR, it allows us to identify more Pfam domains correctly.


top
B44:
Visualizations of a Highly Conserved Eukaryotic Acyltransferase Enzyme and its Paralogues Using Molecular Dynamics Modeling, Conservation Mapping, and Database Annotations

Subject: RNA & Protein

Presenting Author: Cameron Schmidt, The University of Akron

Abstract:
Molecular dynamics modeling, conservation mapping, and selected database annotations were used to generate representative models of a highly conserved eukaryotic trans membrane acyltransferase enzyme and its paralogues using Murine peptide sequences. Diacylglycerol acyltransferase 2 (DGAT2) catalyzes the only committed, and rate limiting step in the synthesis of triacylglycerol from diacylglycerol. DGAT2 is essential to mammalian biological function. It's inhibition has been implicated in the reversal of symptoms related to diet induced liver disease and diabetes. Four immediate paralogues of DGAT2 were modeled for comparison. Target sequences were selected using database annotations and existing experimental evidence. Only topological domains were modeled. Homology modeling was used where model fragments were >200 amino acids in length, in order to generate a template for molecular dynamics simulations. Ab initio modeling was used where the fragment length was <200 amino acids. All models generated were simulated in aqueous solution at physiologic pH using YASARA structure to a minimum time scale of ten nanoseconds. The most energetically favorable models were chosen from each fragment pool. Conservation determined from comparative multiple sequence alignments was mapped onto the model surfaces by color. Interpretation of these models may help drive the formation of hypotheses that impact the elucidation of structural and functional mechanisms of these and other membrane bound acyltransferases.


top
B45:
Novel and Known miRNA Expression Profiling Changes Detected in siRNA-induced Anti-RSV Oryza Sativa

Subject: RNA & Protein

Presenting Author: cheng guo, miami university

Author(s):
Li Li, Chinese Academy of Agriculture Science, China
chun liang, Miami University, United States

Abstract:
miRNAs and siRNA are the two major key small RNAs that could mediate the RNA silencing process to regulate the cellular gene expression. In the past decade, RNA-mediated gene silencing has been extensively determined to serve as a defensive mechanism against virus and bacterial pathogen by host cells. People believe varied expression profiling and the emergence of novel species of miRNA/siRNA introduced by virus infection is the reason to defense against foreign virus. Aiming to gain a deeper insight of this natural-born defense mechanism and the interplay manner that siRNA and miRNA perform in host cell, a RSV (Rice stripe virus)-resistant transgenic plant (T4B1) has been constructed and Illumina RNA-Seq sequencing has conducted using both mutant (T4B1) and control (AiA) rice plants. Our analyses show 1) the increased siRNAs are produced in rice by the inserted hairpin structure originated from nucleocapsid protein gene of RSV. 2) miRNA expression profiling is largely changed due to infection of RSV for both mutant and control groups, indicating potential biological functions for several miRNAs. 3) siRNA and miRNA are strongly associated during this defense process.


top
B46:
Mass spectrometry comparison of recombinant and wild-type cetacean leptin

Subject: RNA & Protein

Presenting Author: Hope Ball, The University of Akron

Abstract:
Mass spectrometry is a vital tool in the identification of proteins, protein-proteininteractions and determinations of the existence and type of post translational modifications. The application of liquid chromatography (LC) and tandem mass spectrometry (MS/MS), together LC-MS/MS, allows for determination and analysis of target peptides from complex mixtures of proteins and has been important in large scale analyses of yeast proteomes (2) and analyses of protein expression in sectioned mammalian tissues (3). To date this method has not yet been applied to studies of leptin protein expression. Leptin, a 16kDa peptide hormone encoded by the obese (ob) gene and synthesized by adipose (fat) cells (1). It is best known for its role in the regulation of energy stores and food intake where the presence of the protein acts to decrease appetite and increase metabolic rate. Arctic-adapted cetaceans build large adipose stores (blubber) and maintenance of these large adipose stores poses questions about the physiological effects of leptin in these animals. Here, LC-MS/MS was used to characterize signal peptides from full-sequence recombinant cetacean leptin. These processes were then applied in attempts to detect and characterize wild-type leptin proteins from sera samples of wild cetaceans. Comparisons of these specta permit the detection of leptin protein from a unique mammalian organism and comparisons of cetacean leptin spectra to known mammalian models.


top
B47:
MULTICOM-RNA – a Bioinformatics Pipeline for RNA-Seq Transcription Data Analysis

Subject: RNA & Protein

Presenting Author: Jilong Li, University of Missouri Columbia

Abstract:
We have developed MULTICOM-RNA - a bioinformatics pipeline - for RNA-Seq transcription data analysis. It includes five steps: mapping RNA-Seq reads to a reference genome, normalizing read counts into gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing biological pathways. In the first three steps, we integrated several public tools, such as TopHat, Cufflinks, HTseq, edgeR, and DESeq, with MULTICOM-RNA to preprocess data and identify differentially expressed genes. The integration of our method and these tools can generate consensus results of good quality. In the last two steps, we use our in-house tools MULTICOM-PDCN and MULTICOM-GNET to predict both functions and gene regulatory networks of differentially expressed genes, respectively. We applied MULTICOM-RNA to the RNA-Seq data generated from different species, such as Human, Mouse, TAIR10 Arabidopsis, and Drosophila Melanogaster. The results on these datasets show that MULTICOM-RNA can effectively identify differentially expressed genes, cluster differentially expressed genes into cohesive function groups, and construct novel hypothetical gene regulatory networks.


top
B48:
Identifying Regulatory Mechanism of eQTLs Using the ENCODE Dataset

Subject: RNA & Protein

Presenting Author: Shweta Ramdas, University of Michigan

Author(s):
Viktoriya Strumba, University of Michigan, United States
Benjamin Keller, Eastern Michigan University, United States
Ellen Schmidt, University of Michigan, United States
Matthew Flickinger, University of Michigan, United States
Elsbieta Sliwerska, University of Michigan, United States
Alan Schatzberg, Stanford University, United States
Edward Jones, University of California, Davis, United States
William Bunney, University of California, Irvine, United States
Richard Myers, Hudson Alpha Institute, United States
Stanley Watson, University of Michigan, United States
Jun Li, University of Michigan, United States
Huda Akil, University of Michigan, United States
Margit Burmeister, University of Michigan, United States

Abstract:
Purpose: Expression-QTLs (eQTLs) are loci that are associated with variation in gene expression. A majority of eQTLs are found in noncoding regions, and the mechanisms by which they influence expression are not easily determined. By overlapping eQTLs with transcription factor (TF) binding site (TFBS) information, we tried to ascertain the mechanism of regulation underlying these eQTLs. We asked what percentage of brain eQTLs fall into TFBS. For a subset of eQTLs that showed opposite direction of effect in brain compared to LCLs, we asked whether TF binding could explain these ‘effect reversals’. Method: We analyzed 100 human brain tissue samples with both expression and genotype data to identify 29,294 eQTLs from 10 brain regions. We used data for 124 TFs characterized by the ENCODE project and intersected their TFBS with our brain eQTLs. Using a published dataset (Stranger et al, 2007), we identified brain eQTLs that were also associated with gene expression in LCLs, and selected eQTLs showing effect reversal in brain compared to LCLs. Results: We found that 6,991 brain eQTLs (24%) lie in a binding site for at least one TF. We also found that a number of eQTLs are binding sites for multiple TFs, with 1023 eQTLs (3.5%) lying in sites that can bind 5 or more TFs. Of the eQTLs with opposite direction of effect in brain compared to LCLs, 54% lie in binding sites of TFs which are known to have both activating and repressive effects, pointing to a possible mechanism for this effect reversal.


top
B49:
Alternative splicing of protein coding and non coding genes of Chlamydomonas reinhardtii

Subject: RNA & Protein

Presenting Author: Praveen Raj Kumar, Miami University

Abstract:
Pre-mRNA splicing is one of the fundamental post-transcriptional processes in eukaryotic gene expression and regulation. Alternative Splicing (AS) occurs when different splice sites of pre-mRNA are processed to generate distinct transcript isoforms from the same genes, leading to diverse proteins with functional and/or structural differences. Focusing on Chlamydomonas reinhardtii, we evaluate AS events and to understand the sequence features responsible for alternatively spliced intron. Based on all available Sanger-based ESTs (338,243) and 454 cDNAs (7,007,189), we evaluated AS events using GMAP and PASA, and updated AUGUSTUS gene annotation (AU10.2) consolidated with resultant PASA AS models. Among 16,237 AUGUSTUS multi-exon protein-coding genes, 52.2% were subjected to AS while 7.85% of PASA deduced multi-exon non-coding genes (1,969) shows AS evidence. As observed in other plants analyzed so far, we found the dominant AS mode is intron retention in both protein-coding (44%) and non-coding genes (42.75%). In an effort to substantiate the difference between constitutive and retained introns, we observed that out of 127 miRNA source in introns 124 was found in constitutive introns while only 3 in retained introns. We also observed that retained introns have weaker splice sites, branch point, with fewer splicing enhancers. In addition we found that the ability to form stem-loop structure is more pronounced in the retained introns.It suggests that short introns that are spliced by intron definition are more likely to be retained if they possess weak signals with ability to form secondary structure, and not a source of miRNA.


top
B50:
Comparison of Thyroxine Binding Globulin to Other Clade A and B Serpins

Subject: RNA & Protein

Presenting Author: Moriah Holt, Franciscan University of Steubenville

Abstract:
The superfamily of serine proteases inhibitors (Serpins) includes many clades known under an alphabetical name (ie: A, B, etc.). Among the clade A serpins is Thyroxine Binding Globulin (TBG), which is known to help synthesize and transport thyroxine and 3,5,3’-triiodothyronine from the thyroid to where it is needed in the bloodstream. Among the serpins, particularly in clade A, TBG’s role is unusual, considering that most of the closely related serpins are primarily protease inhibitors, whereas TBG operates as a transporter. This undergraduate research project sought to compare structure, functionality and phylogenetic similarities of human TBG with 154 selected members of clade A and B serpins from various organisms. Three invariant residues (Ser49, Phe205, Pro286) were found in the amino acid sequence alignment and contribute to the conserved tertiary structure of both clade A and B serpins. Pattern analysis revealed that most of the ten conserved sequence motifs were found throughout all clade A and B serpins, except for motifs 2, 4, 7 and 10 which are not present in serpin A8. Motifs 9 and 10 contribute to the thyroxine-binding site of TBG. Group entropy analysis identified that residues uniquely conserved in TBGs contribute to thyroxine binding. Phylogenetic analysis demonstrated a distinct separation between the clade A and B serpins as well as distinct clades for each separate group of serpins from both clade A and B, except B3 and B4 serpins which are highly similar.


top
B51:
Comparison of Sirtuin-2 to Other Histone Deacetylases

Subject: RNA & Protein

Presenting Author: Michael Niemaszyk, Franciscan University of Steubenville

Abstract:
Histone deacetylases catalyze the removal of acetyl moieties from ε-amino groups of lysine in tail histones, which strengthens DNA-histone interactions. Sirtuin-2 (SirT2) proteins are one group of histone deacetylases. SirT2 converts NAD+ and acetyllysine to lysine, 2’-O-acetyl-ADP-ribose and nicotinamide. This action mediates transcriptional silencing, DNA Repair, genome stability, and cell longevity, among others. This undergraduate research project compared 130 amino acid sequences from a range of organisms for sirtuin-2, as well as sirtuins 3 and 5, in order to determine their structural, functional, and phylogenetic relationships. Twenty-seven invariant residues were identified, including four cysteines that coordinate a structural zinc atom. Not surprisingly, most conservations center within the active site. Nine invariant residues, as well as some other highly conserved residues, coordinate the NAD+ cofactor. The invariant Asn-116 and His-135 both participate in the catalytic mechanism. The hydrophobic pocket involved in peptide binding is also highly conserved. Overall tertiary structures for sirtuins 2, 3 and 5 are also strongly homologous. Pattern analysis revealed that most conserved sequence motifs were found in all three sirtuin types, except sirtuin-5 enzymes have a unique motif pattern in place of motifs 1 and 3 which lie outside of the active site. Phylogenetic analysis reveals that sirtuin-5 is more closely related to fungal sirtuin-2, whereas sirtuin-3 is more closely related to animal sirtuin-2. Group entropy analysis indicated several compensatory mutations in sirtuin-2, several of which are close to the zinc-binding site.


top
B52:
A Temporal Database Mediator for Gene Expression Data

Subject: Databases & Ontologies

Presenting Author: Guenter Tusch, Grand Valley State University

Author(s):
Yuka Kutsumi, Grand Valley State University, United States
Olvi Tole, Grand Valley State University, United States
Mary Ellen Hoinski, Grand Valley State University, United States

Abstract:
PURPOSE: Molecular biological research is often based on measurements that have been obtained at different points in time. The biologist looks at these values not as individual points, but as a progression over time. Our program (SPOT) helps the researcher find temporal patterns in large sets of microarray data.
PROCEDURES: Temporal data maintenance is concerned with the storage and retrieval of temporal data, while temporal reasoning focuses on the use of temporal data for decision support. A temporal (database) mediator is a computer program that allows for integration of the two tasks. Critical to both areas is temporal modeling. One framework of temporal modeling is the KBTA (the knowledge-based temporal abstraction) framework. Temporal abstractions (TAs) enable the conversion of quantitative data to an interval-based qualitative representation. Examples of temporal abstractions are trend TAs that capture increasing/decreasing values in time series and state TAs that correspond to low, high, or normal values in the time series. Temporal abstractions (TAs) convert quantitative data to an interval-based qualitative representation.
OUTCOME: We created a web based temporal database mediator using open-source platforms that supports the R statistical package, PHP, Bioconductor, and Web 2.0 knowledge representation standards using the open source Semantic Web tool Protégé-OWL. We report here on use of the web interface and challenges of using public databases.
IMPACT: Analysis of temporal gene expression data presents a novel opportunity to identify new drug targets and is one potential step to evaluate drugs for their overall effects.


top
B53:
Conceptualization of molecular findings by mining gene annotations

Subject: Databases & Ontologies

Presenting Author: Vicky Chen, University of Pittsburgh

Abstract:
Background
Contemporary genome-scale studies often return a long list of genes of potential interest, e.g., differentially expressed genes in a cancer tumor when compared to normal tissue. A challenging task is to reveal the major biological processes the genes are involved in. The Gene Ontology (GO) is a controlled vocabulary representing molecular biology concepts associated with genes and gene products. However, current gene annotations available from the GO Consortium tend to be highly specific, which are not suited for revealing the major functional themes of a long gene list.
Methods
In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of a subset are represented by informative GO terms that capture the semantic information of the original annotations as much as possible.
Results
We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the IC-based measure as the information loss metric, and the statistics derived from the length of the Steiner tree connecting genes in an augmented GO graph.
Conclusions
Our results indicate that our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.


top
B54:
Drug-Protein Searchable Website

Subject: Databases & Ontologies

Presenting Author: Katelyn Colter, University of Michigan

Abstract:
The aim of the project was to develop a database like this where you can find out which drugs target which proteins and vice versa. Existing databases are not user friendly and provide extensive non-relevant information. Our database will make it easier to find these drug and protein target links, but still provide links back to the original data. In order to do this, we extracted a series of data from one existing database into two Excel spreadsheets. This data was compiled into a MySQL database which was linked to a website. The majority of my work on this project was creating the website. On this site, there is a search engine, so you can type in either a drug or a gene target and you will receive the targets or the drugs treating the targets, respectively. By doing this, I have learned that there are many different drugs that can treat the same target as well as that the same drug can treat many different targets. As a result of making this database, we may be able to take targets that do not have drugs treating them currently and find drugs to treat them. Also, we have made it easier for the public to learn about these drug-target relationships. In the future, this database could play a key role in helping scientists discover new drug/gene target interactions.


top
B55:
CollecTF: a database of experimentally-validated transcription factor binding motifs in bacteria

Subject: Databases & Ontologies

Presenting Author: Ivan Erill, University of Maryland Baltimore County

Author(s):
Elliot White, University of Maryland Baltimore County, United States
Sefa Kiliç, University of Maryland Baltimore County, United States
Joseph Cornish, University of Maryland Baltimore County, United States

Abstract:
The computational analysis of transcriptional regulation in bacteria is limited by the paucity of experimental data on transcription factor-binding sites (TFBS). Even though several databases for prokaryotic TFBS exist, they typically focus on model organisms and/or combine in silico predictions with experimental knowledge, hindering the ability to leverage them in comparative genomics analyses and in the benchmarking of computational tools. Here we introduce CollecTF, a database of prokaryotic TFBS that aims at providing broader coverage than current databases by adopting a more open and flexible system. The primary intent of CollecTF is to provide a curated set of experimentally verified TFBS and, hence, the main emphasis is on a systematic and detailed curation of experimental procedures. Data on CollecTF is curated by a dedicated team using a standardized curation process, but the database encourages direct submission by experimentalists. To prevent the “data tomb” effect, data validated by CollecTF will be automatically submitted to the NCBI, where it will be available in GenBank format. The database provides advanced search and browse capabilities, such as the capacity to retrieve TFBS by specifying individual sources of evidence, clades or regulated operons, as well as dedicated processing of TF-binding motifs and online computation of motif statistics. The current CollecTF release includes data from over 200 publications and XXX different transcription factors, showcasing a comprehensive species-wide mapping of the regulon for four major transcription factors (LexA, Fur, Crp and FNR), as well as the first in-depth coverage of transcriptional regulatory networks in the Vibrio genus.


top
B56:
PlantSecKB: the Plant Secretome Knowledgebase

Subject: Databases & Ontologies

Presenting Author: Xiangjia Min, Youngstown State University

Author(s):
Gengkon Lum, University of Pittsburgh, United States
Jessica Orr, Youngstown State University, United States

Abstract:
PlantSecKB provides a resource of information for all secreted proteins, i.e. secretomes, and proteins located in other subcellular locations for plants.  The database is constructed with all the available plant protein data from the UniProt database and predicted plant protein sequences from EST data assembled by the PlantGDB project. The database contains information from three sources: (1) prediction by computational tools including SignalP, TargetP, Phobius, TMHMM, PS-Scan, and FragAnchor; (2) subcellular locations that were curated or computationally predicted in the UniProtKB; (3) subcellular locations that were curated by our curators from recent literature. The data can be searched by using UniProt accession number, key words, and species and downloaded into a fasta file.  BLAST is available to allow users to query the database based on protein sequences. A tool is also implemented to support community annotation for subcellular locations of plant proteins.  This database aims to facilitate plant secretome research and is available at http://proteomics.ysu.edu/secretomes/plant.html.


top
B57:
A Systematic Co-Expression and Co-Conservation Study of Chromatin Modification Complexes in Eukaryotes

Subject: Evolutionary, Comparative & Meta-Genomics

Presenting Author: Xuejian Xiong, Hospital for Sick Children

Author(s):
John Parkinson, Hospital for Sick Children, Canada

Abstract:
Chromatin modification (CM) complexes play a critical role in many cellular processes, such as regulating transcription, DNA replication, repair and recombination. Previous studies suggest chromatin structure likely has an impact on the co-expression of closely located or interacted genes. In addition, many successful functional studies by gene expression profiling have led to the perception that co-expression is likely to imply functional association. However, to which extent that CM complexes need co-expression among their subunits is not clear; and the question as to whether functionally associated CM genes tend to be co-expressed and co-conserved across Eukaryotes remains unknown. In this paper, we focus on a set of well-validated CM complexes in yeast provided by CYC2008. Using several well-known large-scale microarray expression data sets, high-confident protein-protein interaction data sets from iRefWeb, and comprehensive phylogenetic profiles from Phylopro, we studied the co-expression, co-interaction and co-conservation correlations between genes within each CM complex (i.e. inter-complex) and among different CM complexes (i.e. intra-complex). We found interacted CM genes are statistically significantly co-expressed and co-conserved in general, and CM complexes have statistically higher co-expression and co-interaction of both inter-complex and intra-complex. However, CM complexes have lower co-conservation than random CM complexes, indicating most of CM complexes containing a mix of conserved and specie-specific genes. We also studied CM complexes in details by expression and conservation heatmaps, and hybrid networks, which reveals that CM complexes that are not significantly co-expressed consist of sub-complexes; and CM complexes that are not significantly co-conserved include many specie-specific genes.


top
B58:
What do 1000 genomes tell us about Biased Gene Conversion theory

Subject: Personalized Genetics & Genomics

Presenting Author: Arnab Saha Mandal, The University of Toledo

Author(s):
Shuhao Qiu, The University of Toledo, United States
Xi Cheng, The University of Toledo, United States
Alexei Fedorov, The University of Toledo, United States

Abstract:
Using computational approaches, we mapped the SNPs from 1000 genomes datasets onto three human genes of HMOX1, AGT and ZMAT5. More than half of these SNPs were found to be rare ones with frequencies less than 1%. Linkage disequilibrium was determined for small subgroups of SNPs with higher frequencies (>10%) that illustrated the presence of major haplotype groups (3 in HMOX1, 5 in AGT and 5 in ZMAT5) which were found to be mutually exclusive, stable, and characterized by strong pairwise correlations (R2 > 0.8) between any two SNPs within the same group. While a vast majority of the alleles within the population (>90%)could be associated to a particular major haplotype group, a small minority of them exhibited cross mixing of haplotypes due to recombination events. Special cases among “1000 Genomes” were examined for possible local recombinations, where a span of 1-5kb of a haplotype allele could be inserted in the middle of another mutually exclusive haplotype allele. Such short-scale, unidirectional exchanges of genetic materials may be explained by the formation of DNA heteroduplexes, that form the crux of Biased Gene Conversion (BGC) theory. Our investigations for the aforementioned genes revealed a total of 56 cases involving heteroduplexes. 28 of them corresponded to conversion of G/C pairs to A/T pairs and 20 of them corresponded to conversion of A/T pairs to G/C pairs. We did not find strong evidence in support of BGC theory that heteroduplexes may cause a significant shift toward GC-richness.


top
B59:
An Informatics Reverse Engineering for Cancer Therapy

Subject: Algorithm Development & Machine Learning

Presenting Author: Jianwei Sun, Marca Institute of Biotechnology Shanghai

Abstract:


top
B60:
BiCluE - Exact and Heuristic Algorithms for Weighted Bi- cluster Editing of Biomedical Data

Subject: Algorithm Development & Machine Learning

Presenting Author: Peng Sun, Max Planck Institute for Informatics

Abstract:


top
B61:
MicroRNA identification using linear dimensionality reduction with explicit feature mapping

Subject: RNA & Protein

Presenting Author: Luis Rueda, University of Windsor

Abstract:


top
B62:
Automatic analysis of personal genomes for clinical advisors

Subject: Personalized Genetics & Genomics

Presenting Author: Guy Zinman, Carnegie Mellon University

Abstract:
The next generation of medicine is envisioned to specifically tailor treatments for patients based on their unique genetic profile and lifestyle. With the rapidly decreasing cost of DNA sequencing and the large investments of medical institutes in digitizing medical records this vision is almost at reach. This new revolution requires the integration of multiple streams of data from different sources - whole patient genome, condition-specific genomic measurements, clinical data from the medical record, and other information including clinical and medical literature – to diagnose diseases, identify relevant treatment options, and monitor response to therapies. This wealth of information must make use of automated analysis to support physicians in making clinical decisions.
In order to address these challenges, we recently launched an initiative dubbed ‘Doctor in a Box’ to build an open-source framework that will utilize machine learning algorithms that can incorporate genetic and clinical phenotypes to model the relationships between complex diseases and genome variation, identify a patient's susceptibility to disease, and predict which therapies might be most effective or cause the fewest side effects.


top
B63:
Why MD Equilibrated Protein Structures Are Different Than Experimentally Determined: A Thermodynamic Insight

Subject: Chemical Biology

Presenting Author: Filippo Pullara, University of Pittsburgh

Author(s):
Mert Gur, University of Pittsburgh, United States
Wenzhi Mao, University of Pittsburgh, United States
Ivet Bahar, University of Pittsburgh, United States

Abstract:
Anyone who has performed conventional molecular dynamics (MD), at least once, should have observed that conformers diverge structurally from their starting X-ray crystal structures, sometimes even drastically. Several studies have credited such structural differences, and attempted to provide explanation for this behavior and justify the necessity of MD equilibrations. Although these studies have provided us with important insights, a detailed explanation based on fundamental physics and following validation on a large ensemble of protein structures is still missing.

In this study we first provide thermodynamic insight to the radically different thermodynamic conditions of crystallization solutions and MD simulation environment. Crystallization solution conditions can lead to unphysiologically high ion concentrations, low temperatures and crystal packing with strong specific protein-protein interactions that are not present in physiological conditions. These differences affect protein conformations and functions, due to which equilibrated MD structures are expected to be different from their X-ray structure.

To validate this claim we performed conventional MD simulations for 70 different proteins. The RMSD between the crystal and MD structures yielded values ranging from 1.2-5.0Å after 10 ns of simulations and up to 14 Å after 100ns. Our analysis shows that X-ray structures are good starting points but do not perfectly represent the physiological conditions. This fact has to be taken into consideration when any kind of computational method, such as docking, is used to guide experimental analysis.


top