20th Annual International Conference on
Intelligent Systems for Molecular Biology


Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category M - 'Proteomics'
M02 - Gin-IM : From ISH images to Gene Interaction Networks
Short Abstract: Accurate prediction of gene interactions from gene expression data, especially in multicellular organisms such as Drosophila, requires temporal-spatial analysis of gene expressions. New image based techniques using in-situ hybridization (ISH) have recently been developed to allow large-scale spatial-temporal profiling of whole body mRNA expression. However, analysis of such data for discovering new gene interactions still remains an open challenge.

We present Gin-IM, an automatic system for learning gene interaction networks from Drosophila embryonic ISH images. Gin-IM extends recent work in learning sparse undirected graphical models to predict interactions between genes. By capturing the notion of spatial similarity of gene expression, while taking into account the presence of multiple images per gene via multi-instance kernels, Gin-IM predicts meaningful gene interaction networks.

Using both synthetic data and a small manually curated data set, we demonstrate the effectiveness of our approach in network building. Further, results are reported on a large publicly available collection of Drosophila embryonic ISH images from the Berkeley Drosophila Genome Project, where Gin-IM makes novel and interesting predictions of gene interactions. Contrasting the microarray network using temporal gene expressions with the image based network predicted by Gin-IM, we show that ISH based analysis using Gin-IM captures spatial patterning of gene expressions that the microarray based network is unable to detect.
M03 - Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer
Short Abstract: Background
While genome wide association studies (GWASs) apply statistical tools to find the correlation of each SNP with the phenotype, we consider genome wide predictive studies (GWPSs), which learn classifiers that can predict a phenotype given a genome wide scan of germline of SNPs. We compare the effectiveness of two such GWPSs, for predicting ethnicity and breast cancer.
First, we learned a model for predicting an individual’s ethnicity, as an ensemble of disjoint decision trees from the 270 subjects of the International HapMap Project. This classifier which involves only 149 SNPs, has 10-fold cross-validation accuracy of 100%, and an accuracy of 96.8% when tested on an independent (n=321) dataset.
Second, we applied a combination of a feature selection (MeanDiff) and learning (KNN) methods to a dataset of 623 subjects (302 breast cancer cases and 321 controls). Our learning system produced a classifier based on the 500 top-ranked SNPs (0.1% of the SNPs), whose LOOCV accuracy was 59.55%, which is significantly better than the baseline accuracy (permutation test). Sensitivity analysis shows that our model is robust to the number of SNPs selected. External validation of our learning algorithm using 2287 subjects in CGEMS breast cancer dataset showed 60.25% accuracy.
This study reveals while SNPs perfectly determine an individual’s ethnicity, they provide only weak prediction of breast cancer susceptibility. This can be explained by breast cancer heterogeneity, environmental and lifestyle effects not represented in germline SNPs, effect of other genomic changes (somatic mutations, copy number variations, structural changes, etc), and study’s sample size limitation.
M04 - Localizing genes to cerebellar layers using ISH images classification
Short Abstract: Many neural processes including brain development plasticity and activity are controlled by transcription. Understanding these control processes is hard since the mammalian brain consists of numerous types of neurons and glia, and very little is known about which genes are expressed in which cells and brain layers. Here we describe an approach to detect genes whose expression is primarily localized to a specific layer of the mouse cerebellum. We learn typical spatial patterns of expression from a few markers that are known to be localized to specific layers, and use these patterns to predict localization for new genes. We analyze images of in-situ hybridization (ISH) experiments, which we represent using histograms of local binary patterns (LBP) and train image classifiers and gene classifiers for four layers of the cerebellum: the Purkinje, granular, molecular and white matter layer. On held-out data, the layer classifiers achieve accuracy above 96% (AUC) by representing each image at multiple scales and by combining multiple image scores into a single gene-level decision. When applied to the full mouse genome, the classifiers predicts specific layer localization for hundreds of new genes in the Purkinje and granular layers. Many genes localized to the Purkinje layer are expressed in astrocytes, and involved in lipid metabolism, possibly due to the unusual size of Purkinje cells.
M05 - Fishing for Virulent Factors: Machine Learning Predictions and Experimental Validations of Bacterial Effectors
Short Abstract: Numerous pathogenic bacteria exert their function by translocating a set of proteins, termed effectors, into the cytoplasm of their host cell. The primary goal of this study was to identify novel effectors in a genomic scale, towards a better understanding of the molecular mechanisms of bacterial pathogenesis. We applied a machine learning approach for the detection of effectors in the intracellular pathogen Legionella pneumophila, the causative agent of the Legionnaires' disease, a severe pneumonia-like disease. Our approach is based on the combination of several classification algorithms trained on a variety of features collected on a genomic scale. We applied this methodology to predict and experimentally validate dozens of new effectors. Notably, our computational predictions had a high accuracy rate of over 90%. Having a large pool of identified effectors, we studied the signals that enable the secretion of effectors. We have implemented a hidden semi-Markov model (HSMM) to characterize regions that are recognized by the bacterial secretion machinery. Using the HSMM we were able to detect novel effectors in different species of Legionella, as well as in Coxiella burnetii, an extremely infectious pathogen and a potential bio-terrorism agent. Based on the HSMM we were able to synthesize, for the first time, an artificial secretion signal, and experimentally prove its translocation. Furthermore, we are using similar machine learning approaches to identify pathogenic determinants in several other pathogens, including the food-borne Salmonella enterica, the plant pathogen Xanthomonas campestris, and Pseudomonas aeruginosa – the predominant respiratory pathogen in cystic fibrosis (CF) patients.
M06 - A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins.
Short Abstract: ABSTRACT
Background: Currently, the need to find the function and organismal origin of unknown DNA or peptide fragments is increasing with the advances in sequencing technology. There are many ways to approach this problem but none has emerged as the best protocol. Here we attempt a possible systematic way to approach organismal origins by using a machine learning algorithm which has been used previously for similar problems. The algorithm which we implement is Support Vector Machine (SVM).
Result: Because the amino acid compositions of proteobacterial proteins differ from those of plant host proteins, we developed a SVM model based on amino acid and dipeptide compositions which should be able to distinguish between a proteobacterial protein and a plant host protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.86% with 0.86 MCC (Mathew’s correlation coefficient) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 95.39% and 0.91 MCC. We also developed SVM models based on split amino acid composition (SAAC) and a hybrid model (AAC and DC) which gave maximum accuracies and MCC of 94.32%, 0.89 and 96.30%, 0.93 respectively.
Conclusion: To test the validity of the models, they were tested on unseen or untrained datasets which gave more than 90% accuracy. These models will help to distinguish a proteobacterial sequence from a plant host sequence.
M07 - Pathway-Informed Learning for Cancer-Related Prediction Tasks
Short Abstract: We propose a novel method that directly incorporates information from gene-interaction networks to predict cancer outcomes from high-throughput data sets. The method addresses “curse of dimensionality” concerns in modern cancer data sets that contain tens of thousands of genomic and epi-genomic measurements on genes for only a few hundred samples.

Typically, a model learned from such data, will fail to generalize to new, previously unseen examples. The failure to generalize further raises a question as to the biological relevance of the features deemed important by the model since feature significance becomes inflated due to the severe under-sampling of the feature space. Today’s methods address biological significance through post-processing -- by placing discriminative scores onto a gene interaction network and then searching for saturated sub-networks -- with mixed results.

We introduce a method that encodes pathway information as a collection of basis vectors. The method projects the data onto these bases and trains predictive models using the transformed features, which now represent the activity of biological pathways. Training predictors in the lower dimensional space reduces overfitting while the resulting projection aggregates the genes into meaningful functional groups, which allows for immediate biological interpretation of the results.

We apply our approach to predicting tumor type and drug sensitivity in a panel of breast cancer cell lines as well as human tumor samples from The Cancer Genome Atlas (TCGA). Our results demonstrate that pathway-informed learning is able to identify relevant biological features for the two prediction tasks, often leading to higher accuracy and better interpretability than the traditional gene-by-gene analysis.
M08 - Knowledge-based identification of trans-acting small RNAs in the Escherichia coli genome
Short Abstract: Small non-coding RNA in bacteria offers additional layers of metabolic regulation, either post-translationally or post-transcriptionally. In the latter case, the small RNA molecule base pairs with regions in the target mRNA, followed by degradation of the duplex. Trans-acting small RNAs are a class of ncRNAs that form weak base-pairing with their target mRNAs, often with the aid of an RNA chaperone molecule, hfq. The past decade has seen rapid discovery of several small non-coding RNAs in bacteria, particularly Escherichia coli, owing to advances in high throughput techniques such as genome-wide high-density tiling arrays and deep sequencing of the transcriptome (RNAseq). As the appreciation of the prevalence of this regulation layer grows, identifying the targets and biological function of novel non-coding RNAs still remain a major challenge.
We attempt to address this issue by employing a modified version of a popular machine-learning algorithm, Random Forest, trained on an experimentally verified training set for the classification of true interacting sRNAs-mRNA pairs versus non-interacting pairs. Modifying random-forest into a class-balanced learner, and application of the algorithm’s proximity metric to reduce the number of false positives makes the method amenable to genome-wide predictions. Predicted positive interactions are being experimentally validated.
M09 - Active learning to associate semantically related phenotypes across genetic epidemiologic studies
Short Abstract: Many genetic epidemiologic studies of cardiovascular disease have multiple variables related to any given phenotype, resulting from different definitions and multiple measurements or subsets of data. A researcher searching such databases for the availability of phenotype and genotype combinations is confronted with a veritable mountain of variables to sift through. This often requires visiting multiple websites to gain additional information about variables that are listed on databases, and examination of data distributions to assess similarities across cohorts. While the naming strategy for genetic variants is largely standardized across studies (e.g. “rs” numbers for single nucleotide polymorphisms or SNPs), this is often not the case for phenotype variables. For a given study, there are often numerous versions of phenotypic variables. Manually mapping and harmonizing these phenotypes is a time-consuming process that may still miss the most appropriate variables. Previously, we have developed a supervised learning algorithm that learns to determine whether a pair of phenotypes is in the same class. Though this algorithm accomplished satisfying F-scores, the need to manually label training examples becomes a bottleneck to improve its coverage. Herein we present a novel active learning solution to solve this challenging phenotype-mapping problem. Active learning queries users for labels of unlabeled phenotypes that may improve the mapping the most and therefore will reduce the need of labeling efforts. Active learning will make phenotype mapping more efficient and improve its accuracy, along with intuitive phenotype query tools, would provide a major resource for researchers utilizing these databases.
M10 - Protein Function Prediction Algorithm Based on Infinite State Hidden Markov Model and Bayesian Network Model
Short Abstract: Protein function prediction algorithms have been rapidly gaining importance in the post-genomic era. BLAST, a dynamic programming method, has been widely used for discovering protein functions over the last decade. The BLAST program compares protein sequences to those in sequence databases and calculates the statistical significance of the matches found. A hidden Markov model (HMM) approach has been adopted to incorporate some improvements in the BLAST program. Such methods involve clustering the protein sequences on the basis of experimentally verified and manually curated function annotation databases.
Determining protein functions is a non-trivial task in bioinformatics because many synonyms exist for a given protein. To resolve this problem, the Gene Ontology (GO) project provides a controlled vocabulary set of terms for the consistent representation of genes and gene product attributes across databases. In this study, among the three organizing principles of GO, we use molecular functions as functions of a protein.
We propose an algorithm to predict the functions of a protein via GO terms, given its amino acid sequence. The proposed algorithm is based on a combination of two models. For modeling amino acid sequences, we use an HMM. For modeling the existence of GO terms, we use a Bayesian network model. One of the difficulties in adopting HMMs for real-world problems is that the number of hidden states should be carefully selected. Thus, we propose a non-parametric Bayesian approach to use an infinite number of states with a stick-breaking process prior. A preliminary experiment is conducted using a small dataset.
M11 - Identifying master regulators of cancer and their downstream targets by integrating genomic and epigenomic aberrant features
Short Abstract: Motivation: Vast amounts of molecular data characterizing the ge-nome, epigenome and transcriptome are becoming available for a variety of cancers. The current challenge is to integrate these di-verse layers of molecular biology information to create a more com-prehensive view of key biological processes underlying cancer. We developed a biocomputational algorithm that integrates copy num-ber, DNA methylation, mutation and gene expression data to study master regulators of cancer and identify their targets. Our algorithm starts by generating a list of candidate driver genes based on the rationale that genes that are driven by multiple genomic events in a subset of samples are unlikely to be randomly deregulated. We then select the master regulators from the candidate driver and identify their targets by inferring the underlying regulatory network of gene expression.
Results: We applied our biocomputational algorithm to identify mas-ter regulators and their targets in glioblastoma multiforme (GBM) and serous ovarian cancer. Our results suggest that the expression of candidate drivers is more likely to be influenced by copy number variations than DNA methylation. Next, we selected the master regu-lators and identified their downstream targets using module net-works analysis. As a proof-of-concept, we show that the GBM and ovarian cancer module networks recapitulate known processes in these cancers. In addition, we identify master regulators that have not been previously reported and suggest their likely role. In sum-mary, focusing on genes whose expression can be explained by their genomic and epigenomic aberrations is a promising strategy to identify master regulators of cancer.
M12 - Automated Analysis of Immunohistochemical Images Identifies Candidate Location Biomarkers for Cancers
Short Abstract: Significant efforts have been made to identify candidate biomarkers that differ in expression between normal and cancerous tissues. Such biomarkers may be used for early detection, diagnosis, staging, prognosis, and patient-tailored therapy. In this study we seek to identify potential biomarkers that show differences in subcellular location instead of their quantitative expression. Our approach automatically compares the subcellular location of proteins between immunohistochemistry images of normal and cancerous tissues in order to identify proteins (referred to as “location biomarkers”) whose location distribution changes in the cancer state. The pipeline begins by selecting the subset of available images that pass a quality threshold. Next the images are spectrally unmixed into their hematoxylin and diamino-benzidine components, reflecting the distribution of DNA and protein respectively. Regions of each image are sampled and 650 features are calculated on each region to quantitatively describe staining patterns. The normal and cancer features for each image are compared by a nonparametric test to find proteins that significantly change subcellular location patterns. We have applied this pipeline to images from the Human Protein Atlas for 6000 proteins in 3 tissues. As a positive control we use gross subcellular annotations provided by the Atlas. We show that we can predict these positive controls as well as a new set of putative location markers with high probability.
M13 - A Flexible Integrative Approach Based On Random Forest Improves Prediction Of Eukaryotic Transcription Factor Binding Sites
Short Abstract: Transcription factor binding sites (TFBSs) are DNA sequences of 6 to 15 base pairs and interaction with their binding partners, the transcription factors (TFs), largely determines the observed spatiotemporal gene expression patterns. Here, we checked the extent to which sequence-based prediction of TFBSs can be improved by taking into account the positional dependencies of nucleotides and the nucleotide-sequence-dependent structure of DNA. We make use of the random forest algorithm to exploit both of these types of information in a flexible way. The results we obtained for five eukaryotic TFs with different DNA-binding domains show that for these TFs both the structural method and NPD method can be valuable for prediction of TFBSs. Furthermore, their predictive values seem to be complementary, even to the PWM method. This led us to combine all three methods. The resulting method yields an improved classification accuracy, and performs better than the most recent approach for most of the TFs we tested. The algorithm needs to be refined before it can be used for the prediction of prokaryotic TFs. Nevertheless, all obtained models can be of great use to gain insight into the binding modes of the different TFs.
M14 - Multiple reference genomes and transcriptomes for Arabidopsis thaliana
Short Abstract: Genetic differences between Arabidopsis thaliana accessions underlie the plant’s extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and reannotation of the genomes of 18 natural A.?thaliana accessions, and their transcriptomes.

View Posters By Category

Search Posters: