ISMB/ECCB 2011 Posters

19th Annual International Conference on
Intelligent Systems for Molecular Biology and
10th European Conference on Computational Biology

Accepted Posters

Category 'M'- Machine Learning'

Poster M01

Fast and Efficient Dynamic Nested Effects Models

Holger Fröhlich University of Bonn, Bonn-Aachen International Center for IT

Paurush Praveen (University of Bonn, Bonn-Aachen International Center for IT, Algorithmic Bioinformatics); Tresch Achim (Ludwig-Maximilians-University Muenchen, Gene Center Munich, Deparment for Chemistry and Biochemistry);

Short Abstract: Reverse engineering of biological networks is a key for the understanding
of biological systems. The exact knowledge of interdependencies between
proteins in the living cell is crucial for the identification of drug
targets for various diseases. However, due to the complexity of the
system a complete picture with detailed knowledge of the behavior
of individual proteins is still out of reach. Nonetheless, the advent
of gene perturbation techniques like RNA interference (RNAi),
opened new perspectives for network reconstruction by boosting the
ability to subject organisms to well defined interventions.

Nested Effects Models (NEMs; Markowetz et al., Bioinformatics, 2005) have been introduced as a statistical
approach to estimate the upstream signal flow from the downstream nested
subset structure of high-dimensional perturbation effects (measured e.g. on microarrays). The method was substantially
extended later on by a number of authors and successfully applied to various
datasets (Markowetz et al., Bioinformatics, 2005; Tresch & Markowetz, Stat. Appl. Genome Biol., 2007; Froehlich et al., BMC Bioinformatics 2007; Froehlich et al., Bioinformatics, 2008; Froehlich et al., Biometrical Journal, 2009; Zeller et al., EURASIP J. on Bioinf. and Syst. Biol. 2009; Anchang et al., PNAS, 2009). The connection of NEMs to Bayesian Networks and factor graph models has been highlighted (Zeller et al., EURASIP J. on Bioinf. and Syst. Biol. 2009; Vaske et al., PLOS Comp. Biol., 2009).

Here we introcude a computationally attractive extension of NEMs that enables the analysis of perturbation time series data (measured e.g. on microarrays). It thus complements the attempt of Anchang et al. (PNAS, 2009) to extend static NEMs to the modeling of perturbation time series measurements. Most importantly, this allows for the resolution of feedback loops in the signaling cascade, as well as for the discrimination of direct and indirect signalling. In contrast to Anchang et al. the key idea in our model is to unroll the signal flow over time. This allows for a computation showing some similarity to Dynamic Bayesian Networks and naturally extends the classical NEM formulation. Our model circumvents the need for time consuming Gibbs sampling, which makes it also computationally attractive.

We performed extensive simulations of our model (also compared to a static NEM) to investigate its dependency on the length of time series, the sizes and architectures of the networks to be learned, and on the amount of available data. Our results indicate a very high specificity together with a good sensitivity of our method. The high specificity can be attributed to a special network structure prior favoring sparse networks here, but more generally could also incorporate prior beliefs on specific edges.

We applied our model to data investigating self-renewal in murine embryonic stem cell development in mice (Ivanova et al., Nature, 2006). We found a good accordance of our estimated network between 6 key proteins (5 transcription factors) and the biological literature. Moreover, our result generally agrees with the previous published one by Anchang et al., although being more sparse.

In summary we believe that our approach can serve as a useful tool to generate data driven hypotheses about signaling and/or transcriptional networks based on high-dimensional perturbation effects.

Long Abstract: Click Here

Poster M02

Prediction of bioluminescent proteins based on Discrete Wavelet Transform and Support Vector Machine

Krishna Kumar kandaswamy University of Luebeck

Ganesan Pugalenthi (Genome Institute of Singapore, Laboratory of Structural Biochemistry); Mehrnaz Khodam Hazrati (University of Luebeck, Institute for Signal Processing); Thomas Martinetz (University of Luebeck, Institute for Neuro- and Bioinformatics);

Short Abstract: Background:

Bioluminescence is a process in which light is emitted by a living organism. Most creatures that emit light are sea creatures but some insects, plants, fungi etc also emit light. The biotechnological application of bioluminescence has become routine and is considered essential for many medical and general technological advances. Identification of bioluminescent proteins is more challenging due to their poor similarity in sequence. So far, no specific method has been reported to identify bioluminescent proteins from primary sequence.

Results:
In this paper, we propose a novel predictive method that couples discrete wavelet transform (DWT) with a Support Vector Machine (SVM) based on physicochemical properties to predict bioluminescent proteins. BLProt was trained using a dataset consisting of 300 bioluminescent proteins and 300 non bioluminescent proteins, and evaluated by an independent set of 141 bioluminescent proteins and 18202 non bioluminescent proteins.

Conclusion:
BLProt achieved 73.66% accuracy from training (5 fold cross-validation) and 71.90% accuracy from testing. Multi-level decomposition process using discrete wavelet transform (DWT) coupled with SVM is efficient to identify bioluminescent proteins. Our study shows that intelligent feature design and subsequent utilization of a robust and efficient machine learning algorithm for classification can lead to successful predictions of a wide variety of bioluminescent proteins.

Poster M03

A new approach for discovering knowledge from a large mass of data and optimisation of the numbers of variables before learning Bayesian Networks structure: Application in genetic studies of complex g

Heni Bouhamed University of Rouen

Short Abstract: The objective of our work is to develop a new approach for discovering knowledge from a large mass of data, the result of applying this approach will be an expert system that will serve as diagnostic tools of a phenomenon related to a huge information system. We first recall the general problem of learning Bayesian network structure from data and suggest a solution for optimizing the complexity by using organizational and optimization methods of data before learning a Bayesian Networks structure. We have applied our approach to biological facts concerning hereditary complex illnesses where the literatures in biology identify the responsible variables for those diseases. Finally we conclude on the limits arched by this work.

Long Abstract: Click Here

Poster M04

R-value: A new measure for dataset evaluation

Ariundelger Gantulga Dankook University

Sejong Oh (Dankook University) Oh Sejong (Dankook University, Nanobiomedical Science);

Short Abstract: Dataset evaluation is an important task for finding best datasets for the design of new classifiers. In this paper, we propose a new dataset evaluation measure named R-value. This proposed measure is based on the ratio of overlapping areas among classes in a dataset. A high R-value for a dataset indicates that the dataset contains wide overlapping areas among its classes, and classi?cation accuracy on the dataset may become low. Our study includes analysis results for several datasets using R-value and correlation between R-value and classification accuracy of each dataset. The data used in this paper was abstracted from hypothetical experiment data for research disease including 200 people, and we also chose real datasets from UCI machine learning repository. We experiment the dataset with Naïve Bayes (NB), K-Nearest Neighbors (KNN), Artificial Neural Network (ANN), and Support Vector Machine (SVM) classifiers. From the experiments, we confirmed that the separability of a dataset is strongly related to the degree of overlap among classes in the dataset, and R-value successfully captures the overlap. We can use the R-value measure to understand the characteristics of a dataset, the feature selection process, and the proper design of new classifiers.

Poster M05

Transcription Regulation: models for combinatorial regulation and functional specificity

David Thomas University of Sussex

Melanie Newport (Brighton & Sussex Medical School, Global Health); Sue Jones (University of Sussex, Biochemistry);

Short Abstract: Gene regulation is partially controlled by transcription factor proteins that bind to specific DNA sequences known as transcription factor binding sites (TFBS). Combinations of transcription factors working co-operatively can play a role in regulating gene expression. Current algorithmic techniques cannot distinguish between functional and non-functional sites and tend to predict very large numbers of false positives.

In this work we have combined a series of factors that influence gene expression to create models for the identification of regulatory modules in genes with functional similarities. Such models have the potential to aid in the annotation of orphan genes of unknown function and provide a means of predicting the functional effect of the loss of transcription factor binding sites in specific systems. In an initial step we have used position weight matrices for 76 transcription factors from the JASPAR database and made predictions for TFBS in 35,773 Havana gene regions. We have then integrated these predictions with cellular differentiation information from the Gene Expression Atlas, structural information in the form of NuPoP nucleosome positioning predictions, and temporal information based on potential DNA Methlyation to create predictive models linking functionally related genes. The models are based on machine learning methods, including Hidden Markov Models. The best performing models will be applied to IFN-gamma genetic linkage data from a study of immune responses to BCG vaccination in Humans; with the aim of identifying target genes linked to the IFN-gamma response.

Poster M06

Contrasting the Molecular Basis of Human and Mouse Embryonic Stem Cell Self-Renewal Using Bayesian Network Machine Learning Methods

Karen Dowell University of Maine

Allen K Simons (The Jackson Laboratory, Hibbs Lab); Zack Z Wang (Maine Medical Center Research Institute, Wang Lab); Matthew A Hibbs (The Jackson Laboratory, Hibbs Lab);

Short Abstract: Stem cells have long held tremendous promise as a versatile tool for biomedical research and disease therapy due to their ability to both self-renew and differentiate into distinct cell lineages. These unique cells play pivotal roles in many stages of organism development, tissue homeostasis, and repair. The molecular interactions that drive stem cell self-renewal are only partially understood. More than a dozen signaling pathways are implicated in self-renewal, suggesting regulation by a complex interplay of external signaling cues, transcriptional control, and molecular activities. Marked differences in self-renewal exist between mammalian species as well. Despite this inherent complexity, most models of self-renewal oversimplify the intricate dynamics associated with maintaining a cell lineage throughout development and adulthood. To further characterize the molecular foundations of mammalian embryonic stem cell (ESC) self-renewal, we developed a Bayesian network machine learning approach to integrate, analyze, and compare large collections of high-throughput mouse and human embryonic stem cell data. Our results confirm roles of genes and proteins known to be involved in ESC self-renewal, predict many novel players, and reveal commonalities and divergences between human and mouse ESCs. Computational evaluation shows our results are highly accurate and significantly improved over prior efforts, which overlook the functional importance of specific cell types in mammals. Laboratory validations of our novel predictions are underway. We are developing a comprehensive online resource and dynamic data visualization tool to provide the broader community access to our underlying ESC data and species-specific and comparative analyses.

Poster M07

A new feature selection method based on classes overlap

Jimin Lee University of Dankook

Sejong Oh (University of Dankook)

Short Abstract: Feature selection is one of the most important issues in classification. The goal of feature selection is to select relevant features and eliminate irrelevant features to improve the performance of learning models, speeding up learning process.
In this study, we propose a new efficient feature selection method based on the R-value, which is a measure that is used to capture overlapped areas among classes in a feature. The motivation for using R-value is that the quality of dataset has a profound effect on classification accuracy, and overlapping areas among classes in a dataset have a strong relationship that determines the quality of the dataset. A high score of R-value for a dataset indicates that it contains wide overlapping areas among its classes, and classification accuracy on the dataset may become low.
The R-value feature selection (RFS) method scores the overlapping areas of each feature in candidate features, and then selects features that have low R-value. We tested 8 datasets, 3 classifiers and compared the RFS with FSDD, RelifF and MRMR. We got remarkable classification accuracy in some datasets. In the case of KNN classifications, The RFS brings better classification accuracy than other methods. It is obvious that RFS is a strong feature selection method because it brings best classification accuracy using KNN in most of datasets. And the RFS algorithm takes a reasonable time for huge datasets.

Poster M08

Protein subnuclear localization using a spectrum kernel minimum distance probability method

Esteban Vegas University of Barcelona

Ferran Reverter (University of Barcelona, Statistics); Josep M. Oller (University of Barcelona, Statistics); Jose M. Elias (University of Barcelona, Statistics);

Short Abstract: In this article, we compare the performance of the kernel minimum distance probability method, a new kernel machine, with respect to
support vector machines (SVM) for prediction of the subnuclear localization of a protein from the primary sequence information. Both
machines use the same type of kernel but differ in the criteria to build the classifier. To measure the similarity between protein sequences we employ a k-spectrum kernel to exploit the contextual information
around an amino acid and the conserved motif information. We choose Nuc-PLoc benchmark datasets to evaluate both methods. In
most subnuclear locations our classifier has better overall accuracy than SVM. Moreover, our method shows less computational cost than SVM because our approach avoids to solve any optimization problem in contrast the SVM methodology.

Poster M09

SVM-based prediction of granzyme B cleavage sites

Lawrence Wee Institute for Infocomm Research

Joo Chuan Tong (Institute for Infocomm Research, Data Mining); Esmond Er (Institute for Infocomm Research, Data Mining); Fong Poh Ng (Singapore Immunology Network, Infection Group);

Short Abstract: Characterization of the protease degradome - the complete substrate repertoire of the protease in a cell, tissue or organism - will help unravel the intricacies of protease function. However, the experimental discovery and validation of bona fide protease substrates require time consuming and laborious efforts. As such, computational methods for the prediction of protease substrates would be immensely helpful. Granzyme B is a serine protease which cleaves at unique tetrapeptide sequences and functions as a critical effector in various cellular processes such as apoptosis and inflammation. Here, we present a set of support vector machines (SVM) prediction models employing simple binary and position-specific Bayes feature representation schemes to predict granzyme B cleavage sites using windows of diverse lengths (4- to 24-mers). The models were rigorously trained and tested using 1160 unique, homology-reduced sequences (training dataset: 480 cleavage sites and 480 non-cleavage sites; testing dataset: 100 cleavage sites and 100 non-cleavage sites). The best model achieved an accuracy of 86.50% and AROC of 0.86, which is comparable, if not better than existing methods. In addition, we predicted the Chikungunya viral proteome for potential cleavage sites using our best model and found enrichment of cleavage sites spanning across key functional domains on the structural and non-structural viral proteins – suggesting novel host-viral interactions mediated through granzyme B activity. A server implementing the prediction models was developed and is freely accessible on the web.

Poster M10

Genome-wide enhancer discovery from chromatin methylation marks using Genetic Algorithm-optimized Support Vector Machines

Michael Fernandez Osaka University

Short Abstract: The recent discovery of the gene activatory effect of a broad domain of histone 3, lysine 4 mono-methylation (H3K4me1) combined with low amounts of tri-methylation (H3K4me3) at specific enhancers, has encouraged the systematic mapping of distal regulatory elements. Although a handful of well-performing computational models exist to identify putative enhancer regions using genome-wide chromatin modification maps, their practical application has been limited either by (i) falling short in the number of possible mark combinations that may define different enhancer classes; or (ii) by considering a large combination of marks that in practice is experimentally unviable. In this context, we have developed a method for chromatin state detection using Support Vector Machines in combination with Genetic Algorithm optimization (ChroGaSVM). Characteristic epigenetic profiles were computed by averaging epigenetic map reads in bins at windows of different sizes centered at putative enhancers. Background signals were built by centering profiles at random loci. ChroGaSVM estimates optimum window size and profile combinations to identify putative enhancers. Training with a small set of marks from ENCODE ChIP–chip data on untreated HeLa cells, ChroGaSVM using optimum profiles of 2.5 Kb windows recovered 90% of the experimentally supported enhancers in treated HeLa cells. Next, for a larger pool of 20 ChIP-Seq chromatin methylation libraries done in CD4+ T cells, ChroGaSVM successfully combined 2.5 Kb window-size profiles of five distinct methylation marks to predict ca. 25,000 experimentally supported enhancers with a positive predictive value of ~80%, thereby improving previous predictions on the same dataset by 30%.

Poster M11

Machine Learning Approaches to Identify and Characterize Effector Proteins in Pathogenic Bacteria

David Burstein Tel-Aviv University

Tal Zusman (Tel-Aviv University, Department of Molecular Microbiology and Biotechnology); Michael Pe'eri (Tel-Aviv University, Department of Cell Research and Immunology); Ziv Lifshitz (Tel-Aviv University, Department of Molecular Microbiology and Biotechnology); Michal Simovitch (Bar-Ilan University, Mina and Everard Goodman Faculty of Life Sciences); Dor Salomon (Tel-Aviv University, Department of Molecular Biology and Ecology of Plants); Guido Sessa (Tel-Aviv University, Department of Molecular Biology and Ecology of Plants); Ehud Banin (Tel-Aviv University, Department of Molecular Biology and Ecology of Plants); Gil Segal (Tel-Aviv University, Department of Molecular Microbiology and Biotechnology); Tal Pupko (Tel-Aviv University, Department of Cell Research and Immunology);

Short Abstract: Many pathogenic bacteria exert their function by translocating a set of proteins, termed effectors, into the cytoplasm of their host cell. These effectors subvert various host cell processes for the benefit of the bacteria. The primary goal of this study was to identify novel effectors in a genomic scale, towards a better understanding of the molecular mechanisms of bacterial pathogenesis. We have developed a computational approach for the detection of new effectors in the intracellular pathogen Legionella pneumophila, the causative agent of the Legionnaires' disease, a severe pneumonia-like disease. Our approach is based on machine learning classification algorithms that are applied on a wide variety of features collected in a genomic scale. Applying this method, we have detected and experimentally validated dozens of new effectors. Notably, our computational predictions had a high accuracy rate of over 90%. Having a large pool of identified effectors, we are now utilizing a hidden semi-Markovian model to characterize the C-terminal signal required for effector translocation. Additionally, we are applying the machine learning scheme that we have developed for studying L. pneumophila effectors to identify pathogenic determinants in several other pathogens, including the human pathogen and potential bio-terrorism agent Coxiella burnetii, the plant pathogen Xanthomonas campestris, and Pseudomonas aeruginosa – the predominant respiratory pathogen in patients with cystic fibrosis (CF).

Poster M12

A method for risk predicition of a serious disease using rule-based analysis

Masakazu Sugiyama Graduate School of Information Science and Technology

Shigeto Seno (Graduate School of Information Science and Technology , Osaka University); Yoichi Takenaka (Graduate School of Information Science and Technology , Osaka University); Hideo Matsuda (Graduate School of Information Science and Technology , Osaka University);

Short Abstract: There are diseases that are related to genes. The number of diseases that we can predict the risk of onset from genetic information has increased. In this study, we focused on a serious disease that only a few people have but very likely causes the severe symptoms.
In the case of such diseases, the false negative rate is more important
than the false positive rate. To predict the risk of the disease, we
need to reduce the false negative rate. The usual methods are inadequate since they focused on the overall accuracy rather than the false negative rate. We propose the method for risk prediction that focused on the false negative rate using rule-based analysis.
Our method makes association rules on gene markers of patients. The association rules include no false negatives with training data.
Our method judges the risk of a subject by the number of the fullled rules. As the number of association rules without any false negatives on training data is numerous, our method selects feasible rules and gives them weight of importance.
We used an experiment of a withheld serious disease with 240 people's gene marker that consists of 41 SNP and 6 HLA to evaluate our method. The results indicate that our method can reduce the false negative rate and the overall accuracy become higher than the usual
methods.

Poster M13

Assessment of protein interaction prediction from sequence data using n-gram analysis

Guray Kuzu Koc University

Cengiz Ulubas (Koc University, Computational Science and Engineering); Ozlem Keskin (Koc University, Computational Science and Engineering); Attila Gursoy (Koc University, Computational Science and Engineering);

Short Abstract: Protein-protein interactions (PPIs) are crucial for many biological processes. There are various experimental methods to identify them but computational approaches are also necessary to complement experimental methods. Many computational studies predict PPIs using only sequence data with the help of n-gram analysis, and they have quite successful results. Each group has used different parameters in n-gram analysis and applied their methods on different datasets. However, parameters used in n-gram analysis affect the accuracy of prediction and selection of both positive (interacting) and negative (non-interacting) examples are crucial. In this study, we investigated parameters affecting the accuracy of prediction results and presented an assessment of different positive and negative PPI datasets. Classifications were performed using support vector machine (SVM). We observed that analysis of 3-gram frequencies instead of any other n-lets or a combination of them gave better results. Using complete sequence instead of partial sequence, selecting a larger dataset and considering turn-around list of interactions increased the accuracy of results. However, different amino acid categorizations gave different results in classification. The method utilized to create the negative set is also important. Using negative set based on non-colocalization instead of the one created by random selection led to better predictions. It resulted in higher negative precision and positive sensitivity but lower negative sensitivity. Moreover, assessment of different datasets showed that data trained in one dataset performed better in a similar dataset and validation across datasets did not achieve as good results as the ones within datasets.

Poster M14

Model selection for predictive and prognostic gene signatures

Miika Ahdesmaki Almac Diagnostics

Laura Hill (Almac Diagnostics, -); Nicolas Goffard (Almac Diagnostics, -); Fionnuala McDyer (Almac Diagnostics, -); Timothy Davison (Almac Diagnostics, -); Proutski Vitali (Almac Diagnostics, -); Max Bylesjo (Almac Diagnostics, -);

Short Abstract: Biomarker discovery involves identifying variables (e.g. genes) that are related to an endpoint of interest, for instance patient risk stratification or drug response. When used in classification setting, the biomarker discovery process is commonly aimed towards identifying a signature consisting of a panel of variables that together allow prediction of clinical outcome.

When signatures are developed according to the best practices there are often several models being evaluated giving similar classification performance. Having only information related to classification performance available for each model makes the model selection step difficult and in part arbitrary, as one model can rarely be said to be significantly better than another.

In this research we have explored the use of additional model properties to make more informed and relevant decisions in selecting the top ranking models by building additional metrics into the model generation step, such as biological relevance, clinical utility and analytical precision. The methodology is evaluated using gene expression data sets from the MAQC-II project. We have generated predictive signatures within cross-validation using several classifiers and feature selection methods and ranked them given their classification performance across signature lengths. The signatures were simultaneously analysed by functional enrichment, independence to known clinical covariates, permutation tests and analytical precision, also within cross validation. Our proposed extended biomarker analysis gives considerably more insight into model properties that might otherwise be left unaccounted for and is crucial in selecting a signature that also generalises well and passes external validation.

Poster M15

Prioritization of Epigenetically Modified DNA Regions

Ernesto Iacucci KULeuven

Emina Tufekcic (KULeuven, ESAT/SCD); Yves Moreau (KULeuven, ESAT/SCD);

Short Abstract: DNA methylation and acetylation are two classes of epigenetic
modifications which play an important role in gene regulation. To
date, several modifications have been identified which are associated
with disease phenotypes.

Given candidate regions with measured (ChIP-chip data) DNA epigenetic modification and associated features (Size, TSS localization, associated gene expression, etc.) and qPCR verified regions we propose the use of machine learning techniques to prioritize their investigation.

Through an on-going collaboration with a team of investigators who are providing high quality through-put ChIP-chip data and follow-up qPCR validation we present results we have achieved which show that we have an improved ranking of our candidate list over a naive ChIP-Chip enrichment score ranking.

We intend to continue applying additional machine learning and statistical techniques to prioritize modified regions, in this way we will contribute invaluable insight into the discovery of novel mechanisms of disease-gene regulation.

Poster M16

Dealing with Uncertainty in Support Vector Machines Prediction

Calin Voichita Wayne State University

Sorin Draghici (Wayne State University) Zhonghui Xu (Wayne State University, Computer Science); Roberto Romero (National Institute of Child Health and Human Development, NIH, Perinatology Research Branch);

Short Abstract: In the prediction stage, most classification techniques, including Support Vector Machines (SVM), do not take in consideration certain cases, such as the test data point being very close to the decision boundary or very dissimilar to the training data set, and will still assign the test instance to one of the classes. However, when a test instance is very close to the decision boundary, the side of the boundary on which the instance lies, and hence the predicted class, will depend more on the choice of training parameters rather than on a clear difference in features. Furthermore, if a test instance is substantially different from all the instances used during training, the classical SVM classifier will still assign it to a class although there is little evidence to support such assignment. We propose the automatic detection of an uncertainty area on the prediction of a classifier. The prediction of test data points that fall into this region cannot be guaranteed and therefore will be marked as uncertain. We investigate a number of different techniques to detect uncertainty areas using: the geometric margin of the SVM classifier, convex hulls in feature space and automatic threshold on the posterior probability output. We asses the performance improvement of these techniques both on artificial data and real data from the UCI machine learning repository. The classification accuracy improves compared to the classical SVM depending on the acceptable amount of uncertainty. We also analyze this dependency between the classification improvement and amount of uncertainty.

Poster M17

Identification of SH2-peptide interactions using support vector machine

Kousik Kundu University of Freiburg

Fabrizio Costa (University of Freiburg, Computer Science); Rileen Sinha (Memorial Sloan-Kettering Cancer Center, Computational Biology); Michael Reth (Max Planck-Institute of Immunology, Department of Molecular Immunology); Michael Huber (RWTH Aachen University, Department of Biochemistry and Molecular Immunology); Rolf Backofen (University of Freiburg, Computer Science);

Short Abstract: Src homology 2(SH2) domains are structurally conserved protein domains, found in many intracellular signal-transducing proteins. Phosphorylation of tyrosine residues by tyrosine kinases is an important part of signal transduction. SH2 domains are the largest family of the peptide-recognition modules (PRMs) that recognize phosphotyrosine containing peptides. Hence, these domains have a vital role in cellular signaling. Around 120 SH2 domains have been identified in 110 human proteins and each SH2 domain binds with a specific subset of peptides. Therefore, peptide motif recognition by specific SH2 domains is important for understanding its biological function. Currently only a few programs have been published for the prediction of SH2-peptide interactions but most of them are based on position specific weight matrices (PWMs) which ignore modeling the dependencies between the amino acids. Furthermore, these tools either don’t model for all human SH2 domains or/and are not publically available. In the current study we are developing a machine learning approach for prediction of SH2-peptide interactions, which shall be made publically available. An ideal way to study protein-protein interaction using machine learning approach is to use high throughput data. We used microarray and peptide array data for making positive and negative datasets. Our program selected important novel features and trained the data with those features. We built up separate models for each SH2 domains (based on the available data). We measured performance in terms of the AUC (area under the ROC curve), with 10 fold cross-validation, which is comparable to existing methods.

Poster M18

A Support Vector Machine approach for rapid genome-wide epistasis detection

Barbara Rakitsch Max Planck Institutes, Tuebingen

Limin Li (Max Planck Institutes, Tuebingen, Machine Learning & Computational Biology Research Group); Karsten Borgwardt (Max Planck Institutes, Tuebingen, Machine Learning & Computational Biology Research Group);

Short Abstract: Due to the combinatorial explosion of the space of candidate interactions, mapping phenotypes to groups of interacting loci in the genome is one of the most challenging computational aspects of genome-wide association studies.
Even detecting pairwise epistatic interactions between single nucleotide polymorphisms (SNPs) in a genome is practically infeasible since an exhaustive search would have to consider in order of 10^10 to 10^12 SNP pairs.

In this poster we propose a support vector machine approach to epistasis detection.
Our method allows us to rapidly search for pairs of SNPs that affect binary phenotypes by an epistatic interaction.
We show that our approach performs this task in a runtime which is subquadratic in the number of SNPs.
Our algorithm is therewith orders of magnintudes faster than state-of-the-art methods for epistasis detection while achieving the same levels of power.
Due to its efficiency, our approach can easily handle datasets of the size of the human genome.

Poster M19

Time series classifier model for miR-mRNA interaction

Jovan Rebolledo-Mendez University of Louisville

Nigel Cooper (University of Louisville, Anatomical Sciences and Neurobiology);

Short Abstract: This poster explains the ongoing research being done on the gene expression of mRNA and micro-RNA (miRs) at different intervals of the evolution of an illness. The objective of this research is to find a classifying method involving the gene expression of both the miRs and the mRNA of an organism at different time points. The method consists of using the mRNA’s identified genes, and finding and filtering the target genes from the miRs, so that a characterization of the related GO terms is performed. This permits the discovery of clusters, based on its hierarchical structure, by assigning the GO term with highest rate of associated genes, and then grouping those with others found in a relatively short distance from the weighted ones. Clusters are used as class discriminators of the corresponding found subsets. The associated discriminated GO terms are represented to a microarray that contains all the genes annotated in the respective terms, with a series of samples performed with it in a given time period, producing a vector for each discriminated GO term. The resulting datasets are correlated and used as inputs in a neural network classifier to be trained. The resulting trained neural network will be tested in order to produce a pattern in the relationship between the miRs and mRNA during the different time points given the gene expression of a given illness, so it can serve as decision making tool for identifying the illness in an organism.

Poster M20

mTiM: margin-based transcript mapping from RNA-seq

Nico Görnitz Technical University Berlin

Georg Zeller (European Molecular Biology Laboratory, Structural and Computational Biology Unit); Jonas Behr (Friedrich Miescher Laboratory of the Max Planck Society, AG Raetsch); Andre kahles (Friedrich Miescher Laboratory of the Max Planck Society, AG-Raetsch); soeren sonnenburg (Technical University Berlin, Machine Learning); gunnar raetsch (Friedrich Miescher Laboratory of the Max Planck Society, AG-Raetsch); Pramod Mudrakarta (Friedrich Miescher Laboratory of the Max Planck Society, AG-Raetsch);

Short Abstract: Recent advances in high-throughput cDNA sequencing (RNA-seq) technology have made it a powerful tool for transcriptome studies. A pivotal step in the analysis of RNA-seq data is the accurate reconstruction of expressed transcripts.
Our machine learning-based transcript reconstruction method, which we call mTiM (Margin-based TranscrIpt Mapping), exploits features derived from RNA-seq read alignments and from computational splice sites predictions to infer the exon-intron structure of the corresponding transcripts. In contrast to most gene finding systems, mTiM is strongly evidence-based and models only very few genic sequence motifs. Most importantly, mTiM is able to predict noncoding transcripts as well. Unlike purely alignment-based methods such as Cufflinks or Scripture, it can fill gaps in the read coverage, an advantage for predicting complete transcripts, in particularly for weakly expressed genes.
We applied mTiM to strand-specific, paired-end Illumina RNA-seq data from C. elegans (2x76bp) where reads had been aligned to the genome with different methods. We evaluated the accuracy of mTiM's transcript predictions using annotated genes as a benchmark and compared these to transcripts reconstructed by Cufflinks.
These experiments show that mTiM's transcript reconstruction accuracy is as good as that of Cufflinks and much better than Scripture on carefully curated RNA-seq alignments. However, it is considerably more tolerant to alignment errors present in the results of widely used RNA-seq alignment tools, resulting in improved transcript predictions for most RNA-seq alignments analyzed.

Poster M21

A Computational Paradigm for More Specific TFBS Detection

Heike Sichtig University of Florida

Alberto Riva (University of Florida, Molecular Genetics and Microbiology);

Short Abstract: This poster is based on Proceedings Submission 173.
One of the key challenges of current computational biology is the construction of a model of the regulatory network of a cell. The identification of regulatory patterns in genomic DNA and their relation to specific transcription factors that bind to them is vital to understanding the regulatory infrastructure of a cell. Our paradigm is based on the combination of two biologically realistic information processing methods: third-generation artificial neural network models (spiking neural networks) are used to represent the complex structure of a binding site, while a genetic algorithm is used to optimize the network parameters during a learning phase. The networks are initially trained using known binding sites and negative examples, and are then used as classifiers to detect new TFBSs in genomic sequences. The goal of our work is to reduce the number of false positives in the predicted TFBSs, through a more accurate modeling of the information contained in the alignments that constitute the training data. We present the evaluation of a two-neuron network topology trained to represent TFBSs for four different transcription factors. The networks were trained using real TFBS data from the TRANSFAC, JASPAR and SCPD databases, and appropriately generated negative samples, and were compared against MAPPER, TFBIND and TFSEARCH. Our results show that our paradigm has the potential to attain very high classification accuracy, with a very small number of false positives.

Poster M22

Combined Approach to Anaphora Resolution for Disease-Determinants Ontology Learning from Web Sources

Jae-Hong Eom Max Planck Institute for Informatics

Amy Siu (Max Planck Institute for Informatics, Databases and Information Systems); Gerhard Weikum (Max Planck Institute for Informatics, Databases and Information Systems);

Short Abstract: There are many Web information sources containing fact about diseases and their determinant factors such as PubMed, a collection of research article, or Wikipedia, a world commonsense knowledge base. Many approaches have been tried to distill valuable facts from these sources in recent researches. One step beyond, recently, a number of approaches also have been proposed to model more detailed context of fact using Semantic Web technologies.

However, not a small amount of facts are separated in more than one sentence in text sources, and normally these separated facts are semantically connected trough anaphora. These separated facts also drop the coverage of many modern extraction approaches.

In this work, we propose a combined approach to anaphora resolution. Our method combines three different sub-approaches; top-down, bottom-up, and context propagation. In top-down approach, we imitate human text reading behavior to tract correct correspondent of given anaphora. In bottom-up method, we consider nearest former n sentences to find correct noun using ranking method. Finally, global contexts are simultaneously considered by focusing on major title of page or each section.

Evaluation shows that the proposed method achieves both increased recall and precision. We built disease-determinant ontology for human disease having more than 1 million facts using the proposed method.

Poster M23

Infinite Mixture Model Approach for Protein Function Prediction Algorithm Utilizing Hidden Markov Model and Bayesian Network Model with Dirichlet Process Prior

Takashi Kaburagi Gakushuin University

Yukihiro Koizumi (Waseda University, Graduate School of Advanced Science and Engineering); Go Kobayashi (Waseda University, Graduate School of Advanced Science and Engineering); Kousuke Oota (Waseda University, Graduate School of Advanced Science and Engineering); Yohei Nakada (Aoyama Gakuin University, School of Science and Engineering); Takashi Matsumoto (Waseda University, Faculty of Science and Engineering);

Short Abstract: In the postgenomic era, the importance of protein function prediction algorithms is growing. Although the number of known protein sequences and structures is rapidly growing, the functions of a large number of proteins remain unknown. For the past decade, a dynamic programming method, BLAST, has been one of the commonly used tools for discovering protein functions. The BLAST program compares protein sequences to those in sequence databases and calculates the statistical significance of the matches found. Hence, BLAST can be used to infer functional and evolutionary relationships between sequences. Some improvements to BLAST have been made using a hidden Markov model (HMM) approach.
Determining protein functions is a nontrivial task in bioinformatics because for a given protein, many synonyms exist. To resolve this problem, the Gene Ontology (GO) project provides a controlled vocabulary set of terms for consistent representation of genes and gene product attributes across databases. In this study, among the three organizing principles of GO, we use molecular functions as a function of a protein.
Here, we propose a novel algorithm to predict functions of a protein via GO terms given its amino acid sequence. The proposed algorithm utilizes a combination of two models. For modeling amino acid sequences, we use a HMM. For modeling the existence of GO terms, we use a Bayesian network model. Both models were extended to a mixture model with a Dirichlet process prior. We performed a preliminary experiment using a small dataset.

Poster M24

Kinks in alpha-helical membrane proteins: Manually annotation, extensive analyses and successful prediction

Sabine Mueller University of Saarland

Benny Kneissl (Johannes Gutenberg-University of Mainz, Software-Techniques and Bioinformatics); Christofer S. Tautermann (Boehringer-Ingelheim Pharma GmbH & Co. KG, Lead Identification and Optimization support); Andreas Hildebrandt (Johannes Gutenberg-University of Mainz, Software-Techniques and Bioinformatics);

Short Abstract: The prediction of structural elements based on protein sequences is a major task in bioinformatics. Consequently, many algorithms dealing with secondary structure prediction have been developed. However, it becomes increasingly apparent that distortions of perfect geometries in secondary structure elements provide important additional structural variety and that this variety is often crucial for understanding molecular functions. These distortions can be divided into different types, e.g. wide-turns or kinks in helices. Especially the latter one, which changes the helical axis noticeably, yields a significant change of the structure.
We generated a data set of 132 membrane proteins containing 1014 manually labeled helices and examined the environment of kinks. Our sequence analysis confirms the great relevance of proline and reveals disproportionately high occurrences of glycine and serine at kink positions. The structural analysis shows significant different solvent accessible surface area mean values for kinked and non-kinked helices and demonstrates the influence of tertiary interactions on kinks. Furthermore, the data set was used to validate string kernels for support vector machines as a new kink prediction method. About 80% of all helices could be correctly predicted as kinked or non-kinked using this method. Due to the high prediction accuracy of short sequences and the easy adaption of the data set we are confident to identify the most probable kink position in further studies.

Accepted Posters

Preparing your Poster - Information and Poster Size
Poster Schedule
Vienna Poster Printing Services
Poster Categories
Search for a Poster

Attention Poster Authors: The ideal poster size should be max. 1.30 m (130 cm) high x 0.90 m (90 cm) wide. Fasteners (Velcro / double sided tape) will be provided at the site, please DO NOT bring tape, tacks or pins. View a diagram of the the poster board here

Posters Display Schedule:

Odd Numbered posters:

Set-up timeframe: Sunday, July 17, 7:30 a.m. - 10:00 a.m.
Author poster presentations: Monday, July 18, 12:40 p.m. - 2:30 p.m.
Removal timeframe: Monday, July 18, 2:30 p.m. - 3:30 p.m.*

Even Numbered posters:

Set-up timeframe: Monday, July 18, 3:30 p.m. - 4:30 p.m.
Author poster presentations: Tuesday, July 19, 12:40 p.m. - 2:30 p.m.
Removal timeframe: Tuesday, July 19, 2:30 p.m. - 4:00 p.m.*

* Posters that are not removed by the designated time may be taken down by the organizers and discarded. Please be sure to remove your poster within the stated timeframe.

Delegate Posters Viewing Schedule

Odd Numbered posters:
On display Sunday, July 17, 10:00 a.m. through Monday, June 18, 2:30 p.m.
Author presentations will take place Monday, July 18: 12:40 p.m.-2:30 p.m.

Even Numbered posters:
On display Monday, July 18, 4:30 p.m. through Tuesday, June 19, 2:30 p.m.
Author presentations will take place Tuesday, July 19: 12:40 p.m.-2:30 p.m

Want to print a poster in Vienna - try these options:

Repacopy- next to the congress venue link [MAP]

Also at Karlsplatz is in the Ring Center, Kärntner Str. 42, link [MAP]

If you need your poster on a thicker material, you may also use a plotter service next to Karlsplatz: http://schiessling.at/portfolio/

View Posters By Category

Search Posters:

↑ TOP