HOME

Tweets by @ISMBinfo

Accepted Posters

Attention Conference Presenters - please review the Speaker Information Page available here.

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category B - 'Comparative Genomics'

B01 - Context-specific Errors and Short Reads: Effects on Mapping and Variant Calling

Steven Vensko, North Carolina State University, United States

Short Abstract: Accurate short read mapping is essential for procuring genuine locus-specific allele determination which, in turn, allows for high confidence variant calls. Miscalls resulting in reduced short read alignment are classically defined by molecular and asymmetric biases but are now understood to occur during the sequencing process itself. Here we explore various aspects of context-specific errors, a reproducible and systematic error contingent upon the sequence context preceding error-prone positions. We are specifically interested in quantifying how regions dense with context-specific errors impede proper short read mapping and how this improper short read mapping affects the detection of proximal biological variants. We have utilized publicly available MiSeq sourced short reads which provide sufficient depth of coverage at most positions to empirically map locations of context-specific errors. By mapping the Miseq short reads under a wide spectrum of increasingly conservative parameter sets, we quantify the effects of context-specific errors on read mapping which aids in gauging the probability of true biological variants being undiscovered under common, more stringent mapping parameters.

B02 - ParaDIME : (Parallel Differential Methylation analysis) A statistical suite for genome wide differential DNA methylation analysis.

Sarabjot Pabla, Georgia Regents University, United States

Short Abstract: Aberrant DNA methylation is known to play an important role in the pathogenesis of several types of cancer. Determining the location of these methylation changes is essential for a clear understanding of the epigenetic landscape and its subsequent role in disease development and progression. Current technologies, such as reduced representation bisulfite sequencing (RRBS) allow comprehensive interrogation of DNA methylation on a genome wide scale. However, statistical analysis of genome wide methylation data is complex and resource intensive. To overcome these challenges we developed ParaDIME (Parallel Differential DNA Methylation) which is a parallel algorithm to perform differential methylation analysis on RRBS data. ParaDIME uses a non-parametric RaoScott chi-squared test that does not a assume normal distribution of methylation measurements on the genome allowing for a more appropriate test for differential methylation. Moreover, the parallel architecture significantly increases the speed of analysis. The ParaDIME framework is scalable to accommodate large volumes of RRBS data. We used ParaDIME to evaluate high and low risk subtypes of chronic lymphocytic leukemia (CLL) patients. ParaDIME analyzed 22 million data points for 11 patients in approximately 2 hours, making it significantly faster than current platforms. ParaDIME identified 57,463 differentially methylated CpG sites in two subtypes of CLL patients. Downstream analyses of these sites revealed significant enrichment of cancer related genes and pathways. ParaDIME can also be used to analyze differentially methylated regions such as CpG islands, promoters, enhancers and other regulatory elements. ParaDIME significantly improves the quality and speed of genome wide DNA methylation analysis.

B03 - Accurate and Fast Identification of Genetic Relationship in Large Databases of Genotypes

Yumi Jin, National Institutes of Health, United States

Short Abstract: To assess data quality and integrity we have developed a fast method to identify duplicated participants and close relatives in data housed in database of Genotypes and Phenotypes (dbGaP), which now contains genotypes for over 500,000 study participants across multiple genetic platforms. This method was designed to avoid the limitations with current relatedness analysis algorithms that are practical for no more than a few thousand samples due to the minimal quadratic running time. To implement the algorithm, we extracted genotypes of 10,000 carefully selected set of biallelic Single Nucleotide Polymorphisms (SNPs) that were present on many different genotyping chips and platforms. Non-palindromic SNPs were selected so that the strand orientation could be easily determined by comparing the genotypes on different platforms. We also required that the SNPs be well-separated to minimize the effect of linkage disequilibrium and have high minor allele frequencies to maximize the informativeness of the SNPs. Our algorithm identifies duplicated subjects or monozygotic twins in O(n log n) running time, when tolerating genotyping errors up to 1%. It takes only a few hours to find duplicated subject pairs among ~500,000 samples with our algorithm. We have also created a test statistic called homozygous genotype mismatch rate (HGMR) to quickly identify parent/offspring, full siblings and second degree relatives. In this presentation I will describe the algorithms we have developed to find errors and missing relationships in the pedigree and other phenotypic data submitted to dbGaP.

B04 - A Pairwise Feature Selection Method for Gene Data Using Information Gain

Tian Gui, University of Mississippi, United States

Short Abstract: The current classification technical practice has limitations when using gene expression microarray data. For example, top scoring pair method, its robustness does not extend to some datasets involving small data size. Hence, it is necessary to construct a discriminative and stable classifier that generating highly informative gene sets. As we know, not all the features will be active in a biological process. In this study, Top Discriminating Pair (TDP) approach is motivated by this issue and aims to reveal which of those features are highly ranked according to their discrimination power. To identify TDPs, each pair of genes is assigned a score based on their relative probability distribution. The widely used top scoring pair algorithm owes its success in many microarray datasets involving human cancer to an effective feature selection method. Our experiment combines the TDP methodology with information gain (IG) to achieve a more effective feature set. To illustrate the effectiveness of TDP with IG, we applied this method to two breast cancer datasets. The result from this experimental dataset using the TDP method is competitive with the original TSP technique. Information gain combined with the TDP algorithm used in this study provides a new effective method for feature selection in machine learning.

B05 - Mutational driver pathway collaboration in breast cancer

Zhenjun Hu, Boston University, United States

Short Abstract: An important challenge to understand the mechanism of tumor is to distinguish mutations that might drive cancer initiation and progression, from the much larger number of bystanders that confer no relative survival advantage. However, due to the large number and diverse type of mutations, the frequency of any particular mutation pattern across a set of samples is low, which makes statistical distinctions and reproducibility across different populations difficult to establish.

We report a novel method to discover driver pathways not only by their enrichment of mutated genes but also their invariant presence among majority of samples. The basic idea is that although mutations are heterogeneous and vary among samples, the processes disrupted while cell transformation tend to be invariant across a population of particular cancer or cancer subtype. In contrast to previous methods, we show that mutated pathway-groups can be found in each subtype after additional confirmation by their invariant presence among majority samples in one group but not in the other.

We apply our algorithm to breast cancer subtypes with two steps. The first is identification of pathways that are significantly enriched in genes containing non-synonymous mutations; the second uses pathways so identified to find groups that are functionally related in the largest number of samples. An application to 4 subtypes of breast cancer identified pathway-groups rich in processes associated with transformation. Each group is composed of unambiguously subtype-construable pathways that can highly explicate a particular subtype. The algorithm will be developed as a VisANT plugin available at http://visant.bu.edu.

B06 - GeIST: a command-line pipeline for DNA integration mapping

Matthew LaFave, National Human Genome Research Institute, United States

Short Abstract: The integration of foreign DNA into target genomes is useful in many experimental settings; gene therapy, gene/enhancer traps, and retroviral mutagenesis all take advantage of this approach. In such studies, it is important to determine where the integrations take place. We previously developed a high-throughput method capable of identifying millions of integration sites, and now present the software designed to analyze the results of this method. The Genomic Integration Site Tracker (GeIST) is an automated command-line pipeline designed to interpret ligation-mediated PCR samples from Illumina MiSeq and HiSeq technologies. It identifies DNA integration sites and uses sequence barcodes to group them by sample. GeIST includes support for Tol2- and Ac/Ds-based vectors, as well as murine leukemia virus and adeno-associated virus. We optimized integration recovery for each of these four vectors by analyzing experimental data. GeIST is available at https://github.com/mlafave/GeIST.

B07 - Estimating Performances of distributed approach running new generation sequencing Burrows-Wheeler Aligner

Yu-Tai Wang, National Center for High-Performance Computing, Taiwan

Short Abstract: Background: Personal genome re-sequencing will be getting popular that relies on the cost dropping year by year. Huge computing requirement of re-sequencing analysis can be expected. In order to understand genome alignment computing capacity of infrastructures of Taiwan, we test Burrows-Wheeler Aligner on ALPS which is ranking 243 supercomputer of the Top500 in the world. The testing can help us to understand computing requirement in personal sequencing era.
Materials and Methods: Burrows-Wheeler Aligner is developed by Broad institute. It can be applied on cluster computers, like ALPS. In this estimation, we use BWA 0.6.4 and GRCh37 human reference genome. We downloaded 1 GB FASTQ as the input file. In order to simulate heavy analysis situation, we submit 1,000 identical alignment jobs to ALPS computer and record the during time. At same time, we also record during time of a single alignment job. Base above information, we can easy calculate the amount of computer capacity.
Results: 32 nodes of ALPS can finish 1,000 alignment job within 17 hours and 10 minutes. If running the same job on one single node, it will use over 4 days for finishing the jobs. ALPS have 512 nodes. If we use all nodes to do the same thing, the during time will below 1 and half hours.
Conclusion: There are 200,000 new born babies every year in Taiwan. We expect each year at least 70,000 parents, they would like to use re-sequencing technology to identify the genetic background of their children in future. 2,184 computing hours should be needed by ALPS.

B08 - BioGUI: A Graphical User Interface for Bioinformatics Applications

Francis Bell, Drexel University, United States

Short Abstract: There is currently a need for a commonly available graphical user interface for bioinformatics applications. Development of bioinformatics algorithms and programming libraries has received a substantial effort from the research community; however, the development of a graphical frontend for these algorithms is yet to be developed. The goal of this project is to develop a common framework for the display of and interaction with common bioinformatics entities, such as genes, proteins, species, and networks. The project provides a community-oriented system to meet the need. It is being developed using Python programming language, since a number of libraries for biological data processing are already available, especially through the Biopython project. It involves the execution of available bioinformatics tools and services and development of services for programmatic access by clients.

B09 - Another SVM-based predictor for RNA-binding protein prediction

Shiyong Liu, Huazhong University of Science and Technology, China

Short Abstract: RNA binding proteins (RBPs) play critical roles in various cellular processes. Computational approaches to identify RBPs have an important guiding role in conducting experiments. In this paper, through integrating more features together from proteins, we develop a support vector machine based computational method for predicting RNA-binding proteins named as RBPPR-Seq by combining physicochemical properties, charge and evolutionary information from protein target sequences. Testing on a balanced dataset, which includes 457 RNA binding proteins and 500 non-RNA binding proteins, we obtain a MCC value of 0.79, accuracy of 0.89 and area under the ROC curve (AUC) of 0.96. Other two methods for RBPs prediction from protein sequences, RNA-pred and SPOT-Seq achieve prediction results with MCCs of 0.34 and 0.51, prediction accuracy of 0.67 and 0.74, and area under the ROC curve (AUCs) of 0.69 and 0.81 on the same testing dataset, respectively. Our method outperforms RNA-pred and SPOT-Seq and may be applied to predict RNA binding proteins on genome scale.

B10 - de novo mammalian assembly of one-library PCR-free 250-base Illumina reads

David Jaffe, Broad Institute, United States

Short Abstract: Comparison of mammalian genomes aids understanding of the human genome, however currently available genomes are expensive to produce and limited in resolution. Existing approaches for generating high-quality mammalian assemblies require either the construction of multiple, labor-intensive libraries or the use of data types having a high per-base cost. To enable sequencing of hundreds of mammalian genomes, data costs need to be reduced and algorithms need to be developed to fully exploit these new data types.

We set out to test the assembly potential of a new, low-cost data type comprising 250-base, paired-end Illumina reads from a single PCR-free library. These data cost approximately $3,500 in reagents per library, are relatively unbiased with high contiguity, and can resolve perfect repeats of up to 500 bp.

To generate de novo assemblies using this new data type, we devised a new algorithm that assembles a 60x mammalian genome in approximately 24 hours on a 32-core server. Our algorithm greatly reduces the complexity of read error correction by first generating a draft assembly from uncorrected reads, and then using this initial assembly to guide targeted error correction. Our algorithm then builds a full, accurate assembly using the corrected data.

We demonstrate our method with assemblies for aardvark and white rhino, two mammals that have been previously studied using more expensive techniques. We assess each assembly in comparison to the reference genome for the species and demonstrate the power of this technique for comparative genomics through comparison of these assemblies to the human genome.

B11 - Accelerating RNA Secondary Structure Design Using Pre-Selected Sequences for Helices and Loops

Stanislav Bellaousov, University of Rochester, United States

Short Abstract: Nanoscale nucleic acids can be designed to be nano-machines, pharmaceuticals, or probes for detecting pathogens and other molecules. Nucleic acid secondary structures form the basis of self-assembling nanostructures. Because there are only four natural bases, it can be difficult to design sequences that fold to a single, specified structure. State-of-the-art sequence design methods use stochastic, iterative refinement to select sequences that fold to the specified input structure.
In this work, it is shown that natural RNA structures are composed of helices within an optimal folding free energy range, with high probability of folding to the helix and also with low ensemble defect of helix formation. To facilitate rapid design, a database of RNA helix sequences that demonstrate these features, and also have little tendency to cross-hybridize was built. Additionally, a database of RNA loop sequences with low helix formation propensity and little tendency to cross-hybridize with either the helices or other loops was assembled.
These pre-selected sequences accelerate the selection of sequences that fold with minimal ensemble defect. When using the database of pre-selected sequences as compared to randomly chosen sequences, sequences for biologically relevant structures are designed about 32 times faster, and random structures are designed about 6 times faster. The sequence database is part of RNAstructure package and can be downloaded from http://rna.urmc.rochester.edu/RNAstructure.html.

B12 - The influence of sequence and covalent modifications on yeast tRNA dynamics

Xiaoju ZHANG, University of Rochester, United States

Short Abstract: Modified nucleotides are prevalent in tRNA. Experimental studies reveal that covalent modifications play an important role in tuning tRNA function. In this study, molecular dynamics (MD) simulations were used to investigate how modifications alter tRNA dynamics. The X-ray crystal structures of tRNAAsp, tRNAPhe, and tRNAiMet, both with and without modifications, were used as initial structures for 333 ns explicit solvent MD trajectories with AMBER. For each tRNA molecule, three independent trajectory calculations were performed. The global root mean square deviations (RMSD) of atomic positions show that modifications only introduce significant rigidity to the tRNAPhe global structure. Interestingly, RMSDs of the anticodon stem-loop (ASL) suggest that modified tRNA has a more rigid structure compared to the unmodified tRNA in this domain. The anticodon RMSDs of the modified tRNAs, however, are higher than those of corresponding unmodified tRNAs. These findings suggest that the rigidity of the anticodon stem-loop is finely tuned by modifications, where rigidity in the anticodon arm is essential for tRNA translocation in the ribosome, and flexibility of the anticodon is important for codon recognition. Principal component analysis (PCA) was used to examine correlated motions in tRNA. Additionally, covariance overlaps of PCAs were compared for trajectories of the same molecule and between trajectories of modified and unmodified tRNAs. The comparison suggests that modifications alter the correlated motions. For the anticodon bases, the extent of stacking was compared between modified and unmodified molecules, and only unmodified tRNAAsp has significantly higher percentage of stacking time.

B13 - Tools and Data Services Registry

Jon Ison, EMBL-EBI,

Short Abstract: The Tools and Data Services Registry facilitates discovery of tools and data services across the spectrum. It includes diverse types of application software, including analytical tools and data resources. It spans all of biomedical research from molecules to systems biology and personalised medicine. Content is collated from the many existing software catalogues and collections and served in a highly streamlined, customisable and intuitive user interface. Software is easy to understand if it is described in a simple and consistent way: the registry resources are summarised in terms from controlled vocabularies. The registry helps to save time finding the right tools for the job, avoid duplication of coding efforts and coordinate scientific & technical development.

B14 - Model for Prediction of Change in Folding Rates of Two-State Proteins Upon Point Mutations

M. Michael Gromiha, Indian Institute of Technology Madras, India

Short Abstract: Several single domain proteins folding without any detectable intermediates, crossing a single rate-limiting barrier are termed as ‘two-state folders’. The transition state structures that determine folding mechanism are highly unstable, and cannot be observed directly. Hence, folding mechanisms are deciphered by studying the effect of single site point mutations on the folding rate constants. Identifying simple rules that connect protein sequence/structure based properties to the changes in experimental observables would be immensely helpful in designing mutations to probe for specific mechanisms or understand their propensity to aggregate.
We have developed regression models to discern the relationship between change in folding and unfolding rates of proteins upon point mutations and properties associated with amino acids. 790 point mutants belonging to 23 proteins known to fold via two-state principle were obtained from literature. Among total set of 593 amino acid descriptors, top-tier of 103 prime features were shortlisted using attribute selection approach in machine learning. Data classification according to secondary structure (helix, strand, coil), accessible surface area (0-12%, 12-36%, >36%) and sequential position (N-terminal, Middle, C-terminal region) exhibited the best performance. For each class, top three covariates and an optimized window length were selected via multiple linear regression approach. Jackknife cross-validation produced mean correlation value of 0.73, mean absolute error of 0.42 and accuracy of 80.58% for folding rates. Our method performs twice as better as the only available prediction method. Significance of the outliers will be discussed.

B15 - Identification of effector binding sites on H-Ras explains signal propagation

Serena Muratcioglu, Koc University, Turkey

Short Abstract: Ras proteins (HRAS, NRAS and KRAS) are small GTPases that regulate diverse cellular processes. These proteins activate multiple signaling pathways with complex and divergent effects including cell cycle progression, cell differentiation and survival. Ras proteins cycle between two conformations: GDP-bound inactive and GTP-bound active forms. Active Ras proteins transmit the information through a physical interaction with its downstream effector proteins. Therefore, it is of capital importance to determine the complex structures of ras with these proteins to understand the pathways at the structural level. Here we show that GTP-bound H-Ras interacts with its downstream effector proteins through different interfaces. These interface regions include Switch I effector binding site and allosteric site. The predominant interface region consists of α1, β2, β3, β4 and β5.The effector proteins that bind to these regions are Raf-1, B-raf, PI3Kγ, PLCε, and Byr2 (crystal structures of the complexes are available), Cdc42, Ftase (the complexes are predicted) and RASSF1 (structure modeled and interface predicted). The second interface region populated by RAIN, RGS12, RGL1 (structure modeled and interface predicted) and TIAM proteins (the complexes are predicted) includes α2, α 3, β7, α4, β10 and α5. Few effector proteins such as AFAD, RIN1 and FAK bind to H-Ras through an interface region that partially overlaps both binding sites. Here we also identify mutually inclusive/exclusive interactions by predicting and comparing the interface regions of H-Ras with its partners. This may help us identify the pathways that can be activated simultaneously by active Ras proteins.

B16 - Data Cleaning in Long Time-series Plant Photosynthesis Phenotyping Data

Jin Chen, Michigan State University, United States

Short Abstract: The scale of plant photosynthesis phenotyping experiments is growing exponentially, and they have become a first-class asset for understanding the mechanisms affecting energy intake in plants, essential for improving crop productivity and biomass. However, the quality of phenotyping data is compromised by sources of noise, including systematic errors, unbiased noise as well as abnormal patterns, which are difficult to remove in data collection step. Given the value of clean data for any operation, the ability to improve their quality is a key requirement.

Data cleaning is a traditional computational problem, where ad-hoc tools made up of low-level rules, manually tuned algorithms designed for specific tasks, and statistical methods applied on rational databases. However, removing impurities from long time-series phenotyping data requires the handling of high temporal dimension and separating biological discoveries (geometric or trajectory outliers) from abnormalities, which has not been extensively discussed in literature.

We develop a novel framework to effectively identify abnormalities in phenotyping data. Specifically, our model employs an EM process to repeatedly classify the data into two classes: abnormality and non-abnormality. In each iteration, it uses the non-abnormality class to generate photosynthesis-irradiance curves at different granularities using Michaelis-Menten kinetics, which is the best-known biochemistry model of enzyme kinetics, and reassigns class membership of every value based on their fitness to the curves. Experimental results show that our algorithm can identify most of the abnormalities while keeping the biological discoveries undeleted in both real and synthetic datasets, which is significantly better that the existing goodness-of-fit models.

B17 - NextGenScores: Fast Alignment-based Estimation of Intragenomic Sequence Similarities

Philipp Rescheneder, University of Vienna, Medical University of Vienna, Austria

Short Abstract: A central step of high-throughput sequencing (HTS) analysis is mapping short reads to a reference genome. The accuracy of this mapping step, however, is heavily influenced by sequence similarities within a genome. To accurately identify regions of poor mapping accuracy, we have developed NextGenScore, a framework based on our highly-efficient alignment software NextGenMap.
NextGenScore partitions a genome into synthetic, overlapping reads and maps them back to the source genome using Smith-Waterman (SW) alignments. By averaging the relative difference between best and second best alignment scores for all reads overlapping a particular genomic position, we calculate an unbiased score that represents the maximum similarity of a genomic region to any other region in the genome. Based on this score, we assessed the influence of intragenomic sequence similarities on the read mapping accuracy of several state-of-the-art read mapping programs. By comparing our score to previous work, we show the benefits of using SW-alignments, which are not biased by an explicit upper limit for the number of allowed mismatches, insertions or deletions, for assessing mappability and intragenomic sequence similarities. Furthermore, we show how our pre-computed score can complement mapping quality in identifying ambiguously mapped reads.
NextGenScore outputs a similarity-track that can be used to visually inspect regions of interest in state-of-the-art genome browsers. Additionally, it provides command-line tools to automatically post-process read alignments by filtering out spuriously mapped reads and determining mappable, yet uncovered genomic regions. By this, NextGenScore serves as valuable resource for quality assurance that can easily be integrated into existing HTS pipelines.

B18 - High-Performance De Novo RNA-Transcript Assembly Leveraging Distributed Memory and Massive Parallelization

Pierre Carrier, Cray, Inc., United States

Short Abstract: Exemplifying collaborative software development between industry and academia to tackle computational challenges in manipulating large volumes of next-gen sequence data, leveraging advances in algorithm development and compute hardware, we describe our efforts to optimize the performance of the Trinity RNA-Seq de novo assembly software. Three versions of Trinity's Inchworm computationally intensive part (one that is based on the original OpenMP version, and two new versions that are based on MPI and on Fortran2008). New results of Inchworm’s parallel performance for various real-life problems (e.g., mouse, schizosaccharomyces pombe) are presented, as well as a detailed discussion of the MPI and PGAS computations scheme for Inchworm.

B19 - Variant detection model with improved robustness and accuracy for low-depth targeted next-generation sequencing data

Patrick Flaherty, Worcester Polytechnic Institute, United States

Short Abstract: We present a Bayesian sensitivity analysis of our model to variations in the prior function. We show that a Jeffrey’s prior gives a lower false discovery rate (FDR) to detect a 0.1% minor allele frequency event compared with an improper prior.

B20 - GLAD: A mixed-membership model for heterogeneous tumor subtype classification

Patrick Flaherty, Worcester Polytechnic Institute, United States

Short Abstract: We have developed a mixed-membership classification model, called GLAD, that simultaneously learns a sparse biomarker signature for each subtype as well as a distribution over subtypes for each sample. We demonstrate the
accuracy of this model on simulated data, in-vitro mixture experiments, and clinical samples from the Cancer Genome Atlas (TCGA) project.

B21 - Computational modeling of complex biochemical structures assembly and evolution

Adriana Compagnoni, Stevens Institute of Technology, United States

Short Abstract: This work defines BioScape-L, a new agent based modeling language for the stochastic simulation of complex systems in 3D space. In order to model the assemblies of configurations of polymers, oligomers, and complexes such as microtubules or actin filaments, existing modeling approaches require the programmer to deal with the low level details of collision, confinement, positioning, and diffusion. The motivation for BioScale-L comes from the need to describe the evolution, assembly and polymerization of complex structures of biochemical species in space, while keeping a sufficiently high level description, so that tedious and error prone low level details are hidden from the programmer. The proposed solution is to allow the programmer to describe relative positioning of entities through programmable locations, while keeping collision, confinement and diffusion as part of the simulation engine, as opposed to the model created by the programmer.

Further new aspects of BioScape-L include random translation and scaling. Random translation is instrumental in describing the location of new entities relative to the old ones. For example, when a cell secretes a hydronium ion, the ion should be placed at a given distance from the originating cell, but in a random direction. Additionally, scaling allows us to capture at a high level events such as cell division and growth; for example, daughter cells after mitosis have approximately half the size of the mother cell. The benefits of the new features are illustrated with several examples, including cytoskeletal microtubules polymerization, cell division, and hydronium ion secretion.

B22 - Modified Amber Force Field Correctly Models the Conformational Preference of Tandem GA pairs in RNA

Asaminew Aytenfisu, University of Rochester, United States

Short Abstract: Conformational changes are important for RNA function. We used molecular mechanics with all-atom models to understand conformational preference in RNA tandem guanine-adenine (GA) base pairs. These tandem GA base pairs play important roles in determining the stability and structural dynamics of RNA tertiary structures. Previous solution structures showed that these tandem GA base pairs adopt either imino (cis-Watson-Crick/cis-Watson-Crick interaction) or sheared (trans-Hoogsteen/trans-Hoogsteen interaction) pairing depending on the sequence and orientation of the adjacent base pairs. In our simulations we modeled (GCGGACGC)2 (Wu and Turner 1996) and (GCGGAUGC)2 (Tolbert et al., 2007) experimentally preferred as imino and sheared respectively. Besides the experimentally preferred conformation, we constructed models of the non-native conformations by changing cytosine to uracil or uracil to cytosine. We used explicit solvent molecular dynamics and free energy calculation with umbrella sampling to measure the free energy deference of the experimentally preferred conformation and the non-native conformations. A modification to ff10 required, which allowed the guanine bases’ amino group to leave the base plane (Yildirim et al., 2009). With this modification, the RMSD of unrestrained simulations and the free energy surfaces are improved, suggesting the importance of electrostatic interactions by G amino groups in stabilizing the native structures.
References
1) Tolbert, B. S., Kennedy, S. D., Schroeder, S. J., Krugh, T. R., and Turner, D. H. (2007),Biochemistry 46, 1511−1522
2) Wu, M.; Turner, D. H.. Biochemistry 1996, 35, 9677.
3) Yildirim, I.; Stern, H. A.; Sponer, J.; Spackova, N.; Turner, D. H. J. Chem. Theory Comput. 2009, 5, 2088−2100.

B23 - Pathway based models have similar predictivity and robustness to models based on random collections of genes

Marcelo Segura, Imperial College London,

Short Abstract: In the analysis of gene expression data, it is common to explore the effects in terms of groups of genes or pathways. It is often assumed that these pathway analyses should be more robust to noise in the data than those based on individual genes. We have investigated the robustness of supervised pathway models in terms of their ability to predict an outcome in the face of increasingly noisy data. The initial gene expression profiles were transformed into a new “pathway space” in which each sample is represented by its scores across a large number of known pathways. The robustness of predictive models in pathway space and in the original gene space was investigated through simulated degradation of the expression data. The results show that models in pathway space had similar or better robustness compared to models in the original gene space. Surprisingly, the success of pathway space models is not reliant on the specific definition of the pathway, since randomised pathways produce models of similar predictive accuracy and robustness as those based on the true definition. Nonetheless, we found that the contribution of the different pathways to the predictive models in the true pathway set follows a very different distribution to that in the randomised pathway sets. Overall, this work indicates that predictive ability and robustness alone should not be the only criteria on which to evaluate the utility of pathway based predictive models in the omics sciences.

B24 - A computational framework for microRNA transcription start site identification by integrating high-throughput techniques.

George Georgakilas, University of Thessaly, Greece

Short Abstract: microRNA (miRNA) promoters and primary transcript genes (pri-miRNAs) are still considered “elusive” and the characterization of miRNA expression regulation currently remains an open problem. Following transcription, pri-miRNAs are rapidly cleaved by Drosha enzyme inhibiting their characterization with standard techniques. We have developed a versatile computational framework integrating Next Generation Sequencing (NGS) data and Support Vector Machines (SVM) for the identification of miRNA transcription start sites (TSSs). By applying the algorithm on deeply sequenced NGS data from mouse embryonic stem cells (mESCs), we have identified 70 intergenic pri-miRNA TSSs. The algorithm was validated using mESCs from Drosha null/conditional-null mice, which lack Drosha enzyme, enabling pri-miRNA identification using RNA deep sequencing. The distance from the predictions is less than 250 base pairs for approximately 80% of the experimentally verified TSSs, surpassing any other available computational method. When tested against protein coding genes, even by applying a loose threshold the algorithm performs with 99.5% Accuracy, 98.2% Precision, 99.5% Specificity and 99.7% Sensitivity.

B25 - Assessing and Improving In-House Genome and Transcriptome Assembly Solutions: A Case Study

Jeffrey Cullis, Agriculture and Agri-Food Canada, Canada

Short Abstract: At AAFC, we have developed systems and tools to perform computationally intensive analysis such as genome and transcriptome assembly in an automated, reproducible fashion for researchers working on many target organisms within the organization.

The bioinformatics landscape can change quickly, and incorporation of new tools and best practices is necessary in order to be sure of providing outputs of the highest possible quality. At the same time, there is a lot of useful and often hard-won knowledge embedded in existing systems that should be preserved where it is most beneficial. Here we present an assessment of the advantages and disadvantages of our current system, and propose new directions for development moving forward.

Key aspects of the system to be preserved include the central design goal of a release structure whereby a metadata file providing all sequencing metadata, processing steps, and input and assembly quality statistics are provided along with the final processed outputs. Other useful features include optimization for our specific cluster computing environment, and full automation of steps from raw data acquisition to assembly quality assessment and comparison, to release generation.

A key outcome of our assessment is that greater integration with the Galaxy suite will be necessary moving forward. This would confer a number of advantages, including allowing others to modify and run the pipeline without using the CLI, simplifying and possibly eliminating development overhead for metadata recording, and a more robust ability to extend and interchange tools used in assembly.

B26 - Hierarchical nonparametric association discovery in high-dimensional data with high-dimensional metadata

Joseph Moon, Harvard School of Public Health, United States

Short Abstract: Recent advances in biomedical technologies, statistical theory, and computing have together expanded both the availability of and the capacity to analyze extremely high-dimensional data. However, detecting patterns in data with millions of feature variables in an efficient and well-powered manner remains an open and difficult problem, particularly when assessing heterogeneous data (e.g. phenotypes of different types and units) and when integrating multiple data types (e.g. genetics, gene expression, epigenetics, etc.) Univariate methods lack statistical power when applied naively, while multivariate methods are either computationally expensive or provide qualitative rather than quantitative association results. We have thus developed HAllA, a Hierarchical All-against-All method for discovery of significant relationships among heterogeneous, high-dimensional data features. HAllA uses nonparametric normalized mutual information as a basis univariate test, combined with a cross-hierarchical clustering scheme within or between datasets to find associations of high confidence. It thus detects nonlinear relationships with high sensitivity among both continuous and categorical values, comparing favorably with recent univariate methods such as MIC. HAllA also maintains a better aggregate combination of power and false discovery rate than fully univariate or multivariate approaches. An implementation of the method is publicly available as an efficient, open-source, end-to-end software package (http://huttenhower.sph.harvard.edu/halla).

B27 - Discovering co-variation and co-exclusion patterns in compositional data from the human microbiome

Curtis Huttenhower, Harvard School of Public Health, United States

Short Abstract: Background: Compositional data, or data constrained to sum to a constant total, occur in many scientific areas. The non-independence of such data causes spurious correlations when standard covariance measures are applied, regardless of the similarity measure used. This problem has not yet been addressed in a way that generalizes to different similarity measures, nor for the high-dimensional measurements typical of modern biological data, including data from microbial community studies.
Results: We developed an approach to provide appropriate p-values for varied similarity scores between compositional measurements, which we call Compositionality Corrected by PErmutation and REnormalization (CCREPE). We assessed the false positive rate of CCREPE using synthetic datasets modeling a variety of realistic community structures, as well as comparing its performance and behavior with existing methods. We observed that CCREPE performs better in communities with greater evenness than in more skewed communities. We further applied the CCREPE procedure using a novel ecologically-targeted similarity score (the N-dimensional Checkerboard score) to 682 metagenomes from the Human Microbiome Project to determine significant co-variation patterns while avoiding spurious correlation from compositionality. Overall, the resulting network recapitulated the basic characteristics of earlier 16S-based networks, including little (<15%) between-site interaction and few "hub" microbes (scale-freeness).
Conclusions: These new methods will allow the derivation of significant co-variation networks from high-dimensional compositional data, particularly the detection of species and, eventually, sub-species level ecological interactions within microbial communities.

B28 - Hairpin and multiloop constraints in predicting RNA secondary structure

Peter Clote, Boston College, United States

Short Abstract: Ab initio RNA secondary structure prediction using dynamic programming with free energy parameters obtained from UV absorbance experiments has proven important to molecular biology, with applicatioons ranging from detection of microRNA targets, noncoding RNAs, virulence switches in microbial genes, to the design of novel RNA molecules. Despite success of programs such as MFOLD, accuracy of secondary structure prediction can be improved. To that end, we describe four novel algorithms: RNAhairpin, RNAmloopNum, RNAmloopOrder, RNAmloopHP. Given an RNA sequence, for each integer x, the first three compute the partition function Z(x) repectively for all structures having exactly (1) x hairpins, (2) x multiloops, (3) multiloop order x. Given x,y the algorithm RNAmloopHP computes partition function Z(x,y) taken over all structures having simultaneously x hairpins and y multiloops. Using the partition function, our algorithms then sample structures from the corresponding low energy ensemble. If one knows aspects of the biologically functonal structure, e.g. that transfer RNAs generally have 3 hairpins, then application of our structurally constrained algorithms leads to an improvement in accuracy.

Additionally, by using the FFT, we improve runtime by an order of magnitude. Applications: (1) For many RNA families, structure prediction is improved by RNAmloopHP; for instance, sensitivity improves by almost 24% for transfer RNA, while for certain ribozyme families, there is an improvement of around 5%. (2) Probabilities p(k) of forming k hairpins [resp. multiloops] provide novel, discriminating features for SVM and RVM classiers for RNA families. Our programs, written in C/C++, are publicly available at http://bioinformatics.bc.edu/clotelab/RNAparametric.

B29 - Measure of protein specificity similarity arising from dynamical behavior to engineer specificity switching

Ricardo Corral, Universidad Nacional Autónoma de México, Mexico

Short Abstract: The relation between structure and activity of proteins is an intriguing puzzle that remains unsolved nowadays.
For instance, two enzymes may have very similar folds as measured by RMSD on a structural alignment and even so, be very selective with their substrates.

To explain these differences, we define the common structural region between two proteins (CSR). Since this region is common to both proteins, this region cannot explain the difference in specificity. Thus, we hypothesize that non structurally conserved residues between two proteins may code for residues that shift the CSR conformational ensemble and thus, determining their specificities.

We implemented a method to measure CSR dynamics for estimate specificity similarity. For that, the accessible conformations for a protein structure are sampled by normal mode analysis. Then, the ordered list of residues for each conformation on this ensemble were obtained from graph theory centrality values, as these values are known to be related to functional importance.

These lists of residue orderings represents the protein functional conformations and hence, for proteins with similar function these list sets should be similar.

To test our procedure, we used the previously reported structure of a periplasmic binding protein with specificity for putrescine and the structure of a mutant of this protein with specificity for spermidine (Ulrike Schieb et al. 2013). We observed that our similarity score correlates with their specificity while RMSD score does not.

This specificity similarity estimation can be used to discriminate between mutant models to move specificity profiles toward an objective ligand preference.

B30 - DIDA: Distributed Indexing Dispatched Alignment

Hamid Mohamadi, Canada's Michael Smith Genome Sciences Centre, Canada

Short Abstract: One of the most essential applications in bioinformatics affected by High-Throughput Sequencing (HTS) data deluge is the sequence alignment problem where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large the alignment problem becomes a computationally challenging problem. This is especially true when targets are dynamic such as intermediate steps of a de novo assembly process.

To address this problem, we have designed DIDA, a distributed and parallel indexing and alignment algorithm. First, we partition the targets into smaller parts using a heuristic balanced cut. Next, we create an index for each partition. The reads are then “flowed” through a Bloom filter to dispatch the alignment task to the corresponding node(s). Finally, the reads are aligned on all partitions and the results are combined together to create the final output.

We demonstrate the performance of DIDA when coupled with BWA and Bowtie2 on human chromosome14 and human genome on four nodes. Compared to their baseline performance, when run through the DIDA protocol, BWA and Bowtie2 use less memory (by 75% for both) and execute faster (30% and 40%, respectively) for chromosome14. When tested on a draft human genome assembly, although the improvements in memory performance stay the same, both aligners display better runtime gains (35% and 56%, respectively). DIDA is expected to have a broad uptake in many bioinformatics applications, including large-scale alignments to draft genomes and intermediate stages of de novo assembly runs.

B31 - Correction of Expression Irregularity in RNA-Seq

Ehsan Tabari, University of North Carolina at Charlotte, United States

Short Abstract: High-throughput sequencing of RNA, RNA-Seq, provides unprecedented insight to transcriptome complexity. It has replaced the methods that measure gene expression, is widely used to investigate non-coding RNA, and plays a major role in revealing tissue and condition specific alternative splicing in eukaryotes and alternative operons in prokaryotes. Most of the existing RNA-Seq analysis pipelines assume that RNA reads are uniformly distributed along a transcribed region. However, recent works have demonstrated that this assumption does not hold since a variety of sources introduce bias in read distribution across sequencing protocols and species. Local GC content, cleavage, priming and adapter ligation preferences, and possible RNA secondary structures are possible causes of such bias. It has been shown that such biases drastically affects the transcriptome landscape, and fixing for them produces better expression level correlation between replicate experiments. However, only a few methods have been introduced to address this issue, among which cufflinks, mseq and genominator are noteworthy. Here, we introduce a new computational model that detects and corrects the biases introduced in the experimental steps independently. We show that this multistep model outperforms existing approaches and improves downstream RNA-Seq analysis.

B33 - Efficient simulation of exact and approximate coalescent with selection

Ilya Shlyakhter, Broad Institute, United States

Short Abstract: Efficient simulation of population genetic samples from a given demographic model is a prerequisite for many analyses. Coalescent theory provides an efficient framework for such simulations, but simulating longer regions and higher recombination rates remains challenging. Simulators based on a Markovian approximation to the coalescent scale well, but do not support simulation of selection.

We describe cosi2, an efficient simulator that supports both exact and approximate coalescent simulation (including selection) in a single unified framework. Unlike other exact simulators, cosi2 avoids constructing the full Ancestral Recombination Graph (ARG); instead, it tracks only the much smaller frontier of the ARG. Unlike existing approximate simulators, cosi2 implements the Markov approximation not by moving along the chromosome but by performing a standard backwards-in-time coalescent simulation while restricting coalescence to nodes with overlapping or near-overlapping genetic material. The restriction is efficiently implemented by representing the set of coalesceable node pairs implicitly in a dynamic data structure that supports querying the size of the set and uniformly sampling a node pair to coalesce. The data structure is easy to integrate into any exact coalescent simulator, preserving all existing machinery (including simulation of selection) while adding support for the Markov approximation.

cosi2 allows simulating a wide range of demographic scenarios (selection, varying genetic map, gene conversion, population structure, migration, and population size changes) under both exact and approximate coalescent. It compares favorably with existing simulators on performance, while preserving the properties of output distributions.

B34 - BioGPS: a gene annotation portal customized and contributed by users

Chunlei Wu, The Scripps Research Institute, United States

Short Abstract: Genome-scale studies typically result in a list of candidate genes, most of which aren’t immediately familiar to the researcher who conducted the study. Publically available web-based gene annotation resources help researchers to prioritize their candidate genes for follow-up studies. However, there are hundreds (or more) different resources available, and it’s simply impractical to manually access and review all of them for each candidate gene. Moreover, keeping abreast of new gene annotation resources is a continuing challenge.

BioGPS, http://biogps.org, is a gene annotation portal that catalogs an increasing list of resources and allow users to group their most relevant ones into a customized gene report page (called a gene report “layout”). Users can save these custom layouts, enabling easy access to their favorite resources. Moreover, any user can contribute back to BioGPS by submitting a new resource as a “plugin”. The new plugin is immediately available for this user (“private” plugin) or any users (“public” plugin) to add into their custom gene report layout. With this customizable and extensible framework, the selection of resources that is available to BioGPS users grows over the time, without any intervention from the developers. To date, users have contributed over 600 plugins and customized near 3000 layouts, with almost 2 million page views each year.

B35 - An extended bioinformatics roll for the Rocks Cluster

Glen Newton, Agriculture and Agri-Food Canada, Canada

Short Abstract: Bioinformatics supports a broad area of research: a bioinformatics
cluster supporting a research organization often must include many and
diverse software to support the research goals of its client scientists.
At AAFC, a Rocks cluster has been deployed which supports the research
efforts of some ECORC researchers. These research efforts include
phylogenetics, metagenomics and genomics research on fungi, bacteria, plants
and insects. In order to support these afforts, additional Open Source
software from the bioinformatics community has been packaged (RPMs)
and added to the base Rocks system. These include support for sofware
in support of alignment (abacas, bowtie, cufflinks, exonerate),
alignment (ABySS, amos, BreakDancerMax, GapFiller, idba
(transcriptome), IMAP, jContigSort, maq), genome
browsing/visualization (artemis, circos, gff2ps, IGV), phylogenetics
(beagle, FastTree, FigTree, Gblocks, jModelTest, KING), and
others. Over 148 bioinformatics and related packages have been built,
tested and bundled into a new Rocks roll and a Yum respository, both
available at github. It is our hope that this roll will form an
important building block in the bioinformatics community, reducing
the overhead in managing and deploying bioinformatics software
on a Rocks cluster.

B36 - Genome Annotation using Nanopublications: An Approach to Interoperability of Genetic Data

Rajaram Kaliyaperumal, Leiden University Medical Center, Netherlands

Short Abstract: With the wide spread use of Next Generation Sequencing (NGS) technologies, the primary bottleneck of genetic research has shifted from data production to data analysis. However, annotated datasets produced by different research groups are often in different formats, making genetic comparisons and integration with other datasets challenging and time consuming tasks. Here, we propose a new data interoperability approach that provides unambiguous (machine readable) description of genomic annotations based on a novel method of data publishing called nanopublication. A nanopublication is a schema built on top of existing semantic web technologies that consists of three components: an individual assertion (i.e., the genomic annotation); provenance (containing links to the experimental information and data processing steps); and publication info (information about data ownership and rights, allowing each genomic annotation to be citable and its scientific impact tracked ). We use nanopublications to demonstrate automatic interoperability between individual genomic annotations from the FANTOM5 consortium (transcription start sites) and the Leiden Open Variation Database (genetic variants). The nanopublications can also be integrated with the data of the other semantic web frameworks like COEUS. Exposing legacy information and new NGS data as nanopublications promises tremendous scaling advantages when integrating very large and heterogeneous genetic datasets.

B37 - A Survey of Tools & Platforms for Computational Biology Research

Mehedi M Hassan, University of South Wales,

Short Abstract: In 1995 when the first complete microbial genome was published, a handful of applications supported the browsing, study and analysis of sequences. Over last two decades, a plethora of applications, tools, libraries and platforms came along and contributed to computational biology research. Recent years saw a rise in such application development. Some authors have contributed some area specific lists and classifications. However, a comprehensive listing and comparison is not available.

We have reviewed literatures covering such listings of tools, libraries and platforms. Using publication directories and search engines like Google Scholar, Microsoft Academic Research, Researchgate, PubMedCentral, and article citations, we surveyed use of language specific libraries (i.e. BioPerl, SeqAn etc.). We have also reviewed a range of application suites, workflow and platforms developed by publicly funded projects (i.e EMBOSS, bioLinux, eUtils).

We present our findings on the classifications of tools, platforms and their citation matrices.

B38 - Jalview and JABA: Comparative Visual Analysis of Protein and RNA Sequence and Structure

James Procter, University of Dundee,

Short Abstract: Jalview is a stand-alone and web-based system for interactive visual analysis of multiple sequence alignments, trees, 3D structures and annotation. It is widely used in teaching and research, and the stand-alone application is launched over 270,000 times world-wide each year. JABA web servers provide a range of alignment, conservation analysis, protein and RNA secondary structure prediction services, including JPred and VIENNA, that can be accessed via the command line and through the Jalview Desktop. We describe a range of new developments including the support to allow easy comparison and analysis of predicted and observed secondary structure across multiple sequence alignments.

Jalview and JABA are open source projects that are coordinated by the University of Dundee with funds from the UK’s Biotechnology and Biological Sciences Research Council (BBSRC) and the Wellcome Trust. Several major developments are planned for the next 5 years, including the provision of state of the art homologous sequence search, phylogenetic inference, and integration with ENSEMBL. Our funding also allows us to maintain an active outreach program. This includes training courses for users of all abilities, hackathons, and the provision of up-to-date training materials and sample data for use in teaching. For details of current and future outreach events, please go to http://www.jalview.org/.

B39 - Merging Statistics and Biology: Downstream Analysis Assistant (DAA) Pipeline

Oleg Moskvin, University of Wisconsin-Madison, United States

Short Abstract: RNA-Seq technology has brought potential informativeness of transcriptomic experiments to an entirely new level of accuracy and resolution. However, it remains underutilized because both statistical methods for RNA-Seq data analysis and algorithms of transcriptome assembly are evolving actively, and a significant gap between this methodological evolution and domain expert-driven biological analysis still exists. We believe that successful development of algorithms and pipelines for extracting biologically valuable information from RNA-Seq experiments requires robust feedback loops between biology and statistics. To enable a dialogue between selecting and fine-tuning of the analytical methods, on one hand, and supervised evaluation of higher-level results such as responsive metabolic pathways, GO categories, regulons, metabolite-centric enzyme sets or other externally defined sets of genes (such as biclusters generated in a meta-analysis of public datasets) or gene subnetworks, on the other hand, we have built an expandable pipeline that takes raw reads and associated experimental conditions as input and creates a report on functional and regulatory patterns of response at geneset level. Essential feature of the system is ability to scan through data processing parameters at every stage of the analysis and associate the patterns of biological significance with combinations of data processing parameters utilized at every level, from read alignment options to peculiarities of differential expression testing and unsupervised clustering. Supervised evaluation of the reported biological response patterns in datasets representing better-known biology is used to optimize the processing parameters to help discovering of new biology in new datasets.

B40 - RNA-QC-Chain: Comprehensive and fast quality control for RNA-Seq data

Kang Ning, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, China

Short Abstract: RNA-Seq has become one of the most widely used applications based on next-generation sequencing. Quality control (QC) is the critical first step to ensure obtaining reliable RNA-Seq analysis results from downstream analysis. Here we report RNA-QC-Chain, a parallel and complete QC solution specifically designed for RNA-Seq data. RNA-QC-Chain can accomplish the data QC on three levels (modules): (1) read-quality assessment and trimming; (2) detection and filtration of rRNA reads and possible contamination; (3) alignment quality assessment (including read number, alignment coverage, sequencing depth, alignment region and pair-end read mapping statistics). The processing speed of RNA-QC-Chain is very fast since most of the QC procedures are optimized based on parallel computation.

B41 - Data mining in massive number of microbial communities based on similarity network

Kang Ning, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, China

Short Abstract: Most microbes are not living independently in nature. They always live and reproduce together as “microbial communities”. NGS (Next Generation Sequencing) data of microbial community samples could be analyzed to show the species diversity, abundance and phylogenetic information. With the development of the sequencing technology, the amount of microbial community related data is up to Tera Byte in size. Large number of valuable biological information is contained in these massive datasets. Therefore, how to decrypt these kinds of valuable information based on the relationship between microbial communities has become the new direction in the related research.
This work aims to find the correlation between microbial communities and environmental factors by data mining methods based on the similarity network and the environmental information. The similarity between microbial communities is generated by the phylogenetic quantitative similarity computation of the weighted binary phylogenetic tree for metagenomic data, which can quantitatively measure the overall similarities among microbial communities. In the similarity network, by environmental difference and clustering analysis, as well as evaluation of correlation analysis between the clustering results and the environmental factors, we can infer the diversity among the microbial communities occurred by environmental factors, and then identify the bio-maker taxa that would significantly affect the microbial communities’ structure.
Microbial community data analysis based on the similarity network would serve well for more data-mining and in-depth understanding of the underlining principle controlling the functions and evolution of various microbial communities, which would also have great potential in applications.

B42 - Improving RNA Nearest Neighbor Parameters for Predicting Helical Stability by Going Beyond the Two-State Model

Aleksandar Spasic, University of Rochester Medical Center, United States

Short Abstract: RNA folding free energy change nearest neighbor parameters are widely used to predict the folding stability of secondary structures. They were determined by linear regression to stability data sets determined by optical melting experiments on small model system. Currently, the optical melting experiments are analyzed assuming a two-state model, i.e. the structure can either be complete or denatured. Experimental evidence, however, suggests that structures exist in an ensemble of conformations, some of which can be partially unfolded. Partition functions using nearest neighbor parameters, which are used to predict secondary structure of nucleic acids, also predict that structures can be partially denatured when using existing nearest neighbor parameters. These findings are in direct conflict with the assumption of the two-state model. In this work, a new approach for determining RNA nearest neighbor parameters is presented that does not use a two-state assumption. The available optical melting data for Watson-Crick helices were fit directly to a partition function model that allows an ensemble of conformations for each structure. The fit minimized the difference between fraction of double stranded regions obtained from melting experiments and calculated from partition functions with a non-linear least-squares method. The fitting parameters were the enthalpy and entropy parameters for helix initiation, terminal AU pairs, stacks of Watson-Crick pairs and disordered internal loops. The resulting set of nearest neighbor parameters shows a 41% improvement in describing the experimental melting curves compared to the original set, and demonstrates improved estimates for additional helices not used in the fitting.

B43 - A scale-free structure prior for Bayesian inference of Gaussian Graphical models

OSAMU MARUYAMA, Kyushu University, Japan

Short Abstract: We address the problem of estimating a scale-free inverse covariance
matrix of a Gaussian distribution from its samples. In the graph
derived from the inverse covariance matrix of a Gaussian distribution,
it is known that
there exists an edge
between nodes corresponding to random variables xi and xj if
and only if xi and xj are conditionally independent. The
design of a prior distribution to measure how such a graph
is likely to be scale-free is critical
to achieving good predictability, especially when the number of
given samples from the distibution is limited.
For this estimation problem, we propose a novel
scale-free structure prior and devise a sampling method
for optimizing a posterior probability including the prior.
In a simulation study,
scale-free graphs of 30 and 100 nodes are generated by
the Barabasi-Albert model, and
the proposed method is shown to outperform others on those data.
In a real data experiment,
our method is applied to gene expression profiles, and
biologically meaningful features in estimated graphs are found.

B44 - Molecular indexing enables targeted RNA-qSeq and reveals poor efficiencies in standard library preparations

Weihong Xu, Stanford University, United States

Short Abstract: A simple molecular indexing method was developed for quantitative targeted RNA sequencing (RNA-qSeq), in which mRNAs of interest are selectively captured from complex cDNA libraries and sequenced to determine their absolute abundance. cDNA fragments are individually labeled by molecular indexing prior to PCR amplification, so that each cDNA molecule can be traced throughout the whole library preparation and sequencing process. Clones created by PCR amplification can now be identified and assigned to their distinct parent molecules. We have also constructed a set of synthetic RNA molecules with embedded molecule indices, which can be used as spike-in controls to monitor the library construction process. With the molecular indexing method, we found low efficiency in standard library preparations, which are further confirmed by synthetic spike-in RNA molecules. This finding shows that standard library preparation methods result in the loss of rare transcripts leading to significant representation bias, and highlights the need for developing more efficient sample preparation methods and improved computational methods to better quantify rare transcripts.

B45 - Reformation of Galaxy for local research

Yuan Hao, Cold Spring Harbor Laboratory, United States

Short Abstract: Galaxy is an open source platform for computational study of biomedical data, in particular the intensive data generated by the next generation techniques, without requiring any pre-knowledge of computer programming. Galaxy has expanded largely in scale in the past few years to accommodate newly emerged techniques, tools and user's requirements, making it one of the most comprehensive collections of tools for bioinformatics analysis.

Galaxy is currently organized in a task-oriented way, specific and straightforward, it however lacks the strength of systematic consideration of biological background of data, experiment design, and purpose of the study.

In this work, we have reformatted Galaxy into a new layout with improved specificity through emphasizing and extending tools in demand for research carrying on in the institute, apart from keeping the most essential tools already available in Galaxy. The newly formatted Galaxy has been organized in a project-oriented way by organizing tools into major sections based on the experiment design and data type. In particular, each section has been organized to form a natural flow of work, from data initial processing to result presentation, along with a detailed tutorial covering several standard workflows tailored for the section and the specific type of study. The reformatted Galaxy has provided a more user-friendly and analysis-efficient interface for computational study of biomedical data.

B46 - The “Man-Computer Symbiosis” at 54: A review of design patterns and open problems in biomedical knowledge bases

Ivo Georgiev, University of Colorado School of Medicine, United States

Short Abstract: We review the state of the art of knowledge base design in the biomedical domain and attempt to extract a set of design patterns that would ensure usefulness, persistence, and ease of maintenance. In 1960 J.C.R.Licklider, one of the most important figures in computer science history, wrote his vision paper “Man-Computer Symbiosis” in which he imagined a strong collaboration between humans, capable of complex planning and reasoning in continuously varying contexts, and computers, capable of performing enormous computations enormously faster than humans, but completely lacking creativity. We pick about half a dozen representative knowledge base projects from the biomedical domain to illustrate what makes a knowledge base, tease out what the right time is for a particular (sub-) domain to have a knowledge base developed for it, and attempt to systematize a set of design patterns for knowledge bases. Our evaluation criteria are based on Licklider’s vision. Is the human expert fully enabled? Has every piece of repetitive and “dull” work been offloaded to the computer? Has the interface friction been reduced to a minimum? Can the symbiotic relationship tackle new data, exponentially growing data, and new context? To ground and test our findings in a concrete domain, as well as to explore the open problems, we describe the proto-type design for a knowledge base for spinal cord injury and regeneration.

B47 - Condition Specific Promoter Identification in E. coli using Heterogeneous High-Throughput Sequencing Data

Irene Ong, University of Wisconsin-Madison, United States

Short Abstract: Multiomic analyses of improved Escherichia coli ethanologen strains have revealed major barriers to rapid and efficient conversion of cellulosic biomass to ethanol. A key to successful engineering of improved biofuel production by microbes is tight control over the timing and strength of expression of genes involved in sugar uptake, stress reduction, and metabolic conversion functions. Thus, our goal is to identify condition-specific promoters of genes that are active during the conversion of sugar to ethanol by E.coli ethanologen grown in lignocellulosic media derived from ammonia fiber expansion (AFEX) treated corn stover hydrolysate. To achieve this goal, we have developed an algorithm to identify promoters (assigned by their transcription start-sites) that change activity under specific conditions, including the different phases of hydrolysate fermentation. Our algorithm combines evidence from genome wide transcription start site profiling, ChIP-seq data of various sigma factors (σD, σS, σE, σN, and σH), and strand specific RNA-seq data using a probabilistic model. We describe our approach in detail and present our findings on data for wild-type E. coli under +/- O2 conditions and for our improved E.coli ethanologen strain under +/- lignotoxin conditions.

B48 - Glycomic Elucidation and Annotation Tool (GELATO): A Free Tandem MS Annotator For Glycomics

Khalifeh AlJadda, University of Georgia, United States

Short Abstract: Several algorithms have been developed in attempts to automate the process of glycan identification by interpreting tandem MS spectra. However, each of these programs has limitations when annotating MSn data with hundreds or thousands of spectra using polluted public databases. Glycomic Elucidation and Annotation Tool (GELATO) is a free, semi-automated tandem MS interpreter which was designed and implemented at the Complex Carbohydrate Research Center. GELATO provides a novel algorithm to automate the tandem MS interpretation process. The annotation algorithm is implemented as part of the glycomics data processing software “SimianTools” that provides a user friendly graphical interface for defining experimental parameters, viewing and exploring results, and refining the annotations.

B49 - Standard-Free Bayesian Integration Improves the Predictive Power of Genomic Datasets

Marcus Badgeley, Mount Sinai School of Medicine, United States

Short Abstract: Modern molecular technologies allow the collection of large amounts of high throughput data on the functional attributes of genes. Often multiple technologies and study designs are used to address the same biological question, such as which genes are overexpressed in a specific disease state. Consequently, there is considerable interest in methods that can integrate across datasets to present a unified set of predictions.
An important aspect of data integration is being able to account for the fact that datasets may differ in how accurately they capture the biological signal of interest. While many methods to address this problem exist, they always rely either on dataset internal statistics, which reflect data structure and not necessarily biological relevance, or external gold standards, which may not always be available. We present a new rank aggregation method for data integration that requires neither external standards nor internal statistics but relies on Bayesian reasoning to assess dataset relevance. We demonstrate that our method outperforms established techniques and significantly improves the predictive power of rank based aggregations. We show that our method, which does not require an external gold standard, provides reliable estimates of dataset relevance and allows the same set of data to be integrated differently depending on the specific signal of interest.

B50 - Semi-automated Screening of Compounds Through the Use of Computational Analysis

Edwin Solares, University of California, Irvine, United States

Short Abstract: In the search to find potential small molecule candidates for in vivo testing, in silico methods are often used in order to reduce the search space, as testing thousands of small molecules in vivo is cost prohibitive. In silico methods have been developed in order to rank potential candidates based on docking to a ligand or protein, such as Autodock Vina, Surflex, molecular dynamic simulations and others. Although these programs help in reducing the search space, these methods alone are not enough. Other methods must be used that are independent of each other, and can lead to increased complexity. In order to reduce this complexity, we developed a pipeline, and semi-automated this process. This was done with the use of a database and scripting within a Linux environment. This only helps alleviate some of the work, as it allows the researcher to store and filter these results using a database, but is not fully automated and still complex and technical. However, with the use of this pipeline, a web based application can convert this pipeline into a more automated system, which takes advantage of a database and interfaces with the Linux operating system to help with script generation. This automated script generation, can be easily done given a template and parameter input from the end user and can be reused further reducing complexity. This web based tool also allows for customized pipelines and integration of other tools, algorithms, such as, machine learning and allow for jobs submission in computational clusters.

View Posters By Category

Search Posters:

TOP