IndeCut: A Cut-norm Based Method for Evaluating Independent and Uniform Sampling in Network Motif Discovery Algorithms

Presenting Author: Mitra Ansariola, Oregon State University

David Koslicki, Oregon State University
Molly Megrew, Oregon State University

Network motif discovery is a well-established general statistical strategy for identifying over-represented sub-network structures within a larger network. In the biosciences, it serves as a prominent conceptual tool that enables scientists to recognize biologically important patterns and generate testable hypotheses within large genetic networks of interest. Network motif discovery algorithms function by comparing the frequency of particular sub-network of interest within a given 'real-world' network to its frequency in a large collection of randomized networks. While the method of randomization may differ, all algorithms face the challenge of how to sample uniformly and independently from the set of all possible randomized networks that may be generated. Though several network motif discovery tools with different underlying random sampling strategies are available, scientists who want to apply these tools on their own networks of interest currently do not have any method by which to assess whether any tool will provide an accurate outcome. Most users will not be able to test the correctness of detected motifs in the laboratory due to prohibitive cost, so it is essential to have access to such an evaluation method. In this talk, we present IndeCUT, the very first method that numerically determines the degree of sampling uniformity and independence of network motif discovery algorithms. IndeCUT is the first and only method to date that allows characterization of network motif finding algorithm performance beyond computational efficiency.

Reproducible Computational Workflows with Continuous Analysis

Presenting Author: Brett Beaulieu-Jones, University of Pennsylvania

Casey Greene, University of Pennsylvania

Reproducing experiments is vital to science. Being able to replicate, validate and extend previous work also speeds new research projects. Reproducing computational biology experiments, which are scripted, should be straightforward. But reproducing such work remains challenging and time consuming. In the ideal world we would be able to quickly and easily rewind to the precise computing environment where results were generated. We would then be able to reproduce the original analysis or perform new analyses. We introduce a process termed "continuous analysis" which provides inherent reproducibility to computational research at a minimal cost to the researcher. Continuous analysis combines Docker, a container service similar to virtual machines, with continuous integration, a popular software development technique, to automatically re-run computational analysis whenever relevant changes are made to the source code. This allows results to be reproduced quickly, accurately and without needing to contact the original authors. Continuous analysis also provides an audit trail for analyses that use data with sharing restrictions. This allows reviewers, editors, and readers to verify reproducibility without manually downloading and rerunning any code.

Application Ontologies Supporting Phenotyping from Clinical Text

Presenting Author: Wendy Chapman, University of Utah

Representation of the knowledge described in clinical reports is critical to accurate phenotyping of patients. We have developed two application ontologies for modeling annotations of clinical reports: the schema ontology describes the clinical entities that are described in reports, such as findings, procedures, and medications. The modifier ontology enumerates the allowable modifiers for those entities with three types of modifiers: shared modifiers that apply to all entities: negation, uncertainty, and temporality; semantic modifiers specific to particular entities, such as dose and route for medications; and numeric modifiers for specifying numeric values such as body temperature. A user can create a domain ontology by creating instances of entity-modifier combinations, accommodating rich phenotypic representation for concepts like no family history of colon cancer or severe carotid stenosis in the right internal carotid artery. In addition to modeling the semantic composition of the concepts, the ontologies provide value sets and lexical variants that can be customized and enhanced. Our long-term goal is to create shareable libraries of domain ontologies.

In addition to supporting annotation of concept mentions, swirl rules stored in the ontology support inferencing over mention annotations for classification at the document, encounter, and phenotypic/patient level. The ontologies support rich phenotypic characterizations to go beyond binary phenotypes toward answering questions like “what histologic types of breast cancer are associated with patients that have a substitution mutation on BRCA-1?” and “for patients with a papillary breast carcinoma that underwent neoadjuvant treatment regimen, what number of patients have had a recurrence or metastasis?”

SPARQLer: Making Knowledge Functional

Presenting Author: Daniel McShan, University of Colorado School of Medicine

Knowledge base triplestores are notoriously challenging to navigate. This talk will cover some approaches for making the stored knowledge more functional. While SPARQL 2.0 is extraordinarily powerful, it can be difficult for someone unfamiliar with the underlying graph structure to quickly construct meaningful queries.

SPARQLer is a simple syntactic sugar layer in Clojure which allows for the construction of functional decomposable queries that are modular and reusable. These functional components can then be easily tested against easily validated examples. SPARQLer is demonstrated as a functional front end to the Hunter lab's KABOB Knowledge Base of Biology, and various examples will illustrate the modularity and the reuse potential of the approach.

Improved Network Ontology Analysis by Segmentation

Presenting Author: Ananda Mondal, Claflin University

Charles Schultz, University of Utah
Markea Sheppard, Claflin University
Jasmine Carson, North Carolina A&T State University

Improved Network Ontology Analysis by Segmentation
Charles Allen Schultz II, Markea Sheppard, Jasmine Carson and Ananda Mohan Mondal*
Department of Mathematics and Computer Science, Claflin University, Orangeburg, SC, USA
*Corresponding Author: amondal@claflin.edu


Our recent study in filtering disease proteins, based on subcellular protein locations, from a protein network biomarker for liver cancer, resulted in groups of proteins at different locations with two distinct network structures, namely, clique and bipartite graph. This motivated us developing a Segmentation Algorithm, which can be used as a preprocessing tool to provide a better gene-term enrichment analysis based on network ontology.
The proposed algorithm breaks the source network into component subgraphs using an appropriate metric such as bipartite-like or clique-like subgraphs. The component subgraphs generated above are independently analyzed using the Network Ontology Analysis (NOA) method. The independent results, which contain overlapping ontological components, are integrated to form a single representation for gene-ontology analysis.
We applied the developed technique as a preprocessing tool for NOA analysis of protein network biomarkers for Adherens Junction and Breast Cancer. Results showed that the proposed algorithm produces a more concise and easily interpretable representation of the gene-term relationship compare to the representation produced using NOA only.

Keywords: Gene-Term Enrichment, Gene Ontology Analysis, Network Ontology Analysis, NOA, Segmentation and Subgraph.

An Image Phenotyping Environment Based on Open-Source Tools

Presenting Author: Brian Chapman, University of Utah

John Roberts, University of Utah

Medical imaging data are an often-overlooked resource for defining patient phenotypes. Because images data are unstructured, in order to extract information from the images requires creating pipelines for identifying relevant studies, segmenting and quantifying features from the images, and linking these features to other data sources (e.g. the EHR). We are building an image phenotyping environment based on open-source deployed using Docker (https://www.docker.com/), allowing us to version-control our environments, which are defined with simple text files. Our phenotyping pipeline is built using three open-source projects: 1) Orthanc (http://www.orthanc-server.com/), a light-weight DICOM server for communicating with the clinical PACS and scrubbing images for research purposes. Orthanc allows for persistent, customized scrubbing processes. 2) Girder (https://girder.readthedocs.io/en/latest/), an open-source, web-based data managment system developed by Kitware, Inc. Girder provides user authentication, access control and a framework for linking data and defining meta-data. We have integrated Girder with bioportal so that data uploads are tagged with concepts from relevant ontologies. 3) JupyterHub for providing web-based computational environments. JupyterHub provides Docker containers serving up Jupyter notebooks. Jupyter notebooks allow for programming through the web browser and supports a number of languages including Python and a number of other languages. Jupyter notebooks contain image processing pipelines for extracting features from medical images using SimpleITK and other software packages Our initial use-cases are drawn from dermatology and radiology and require both 2D and 3D feature extraction tasks.

InterViewer, a new Cytoscape-based viewer that displays interactions between selected sets of proteins

Presenting Author: Marek Tutaj, Medical College of Wisconsin

Jyothi Thota, Medical College of Wisconsin
Jeff De Pons, Medical College of Wisconsin
Jennifer Smith, Medical College of Wisconsin
Thomas G Hayman, Medical College of Wisconsin
Victoria Petri, Medical College of Wisconsin
Stan Laulederkind, Medical College of Wisconsin
Shur-Jen Wang, Medical College of Wisconsin
Mary Shimoyama, Medical College of Wisconsin

InterViewer, RGD’s new Cytoscape-based protein-protein interactions viewer, (https://rgd.mcw.edu/rgdweb/cytoscape/query.html), facilitates a detailed visualization of interactions between proteins. As usual, RGD provides interaction data not only for rat, but also for mouse and human. The tool accepts input in multiple ways: as a list of UniProt IDs, RGD IDs or gene symbols. On the display page, binary interaction data from IMEX are displayed as nodes and edges, which can be zoomed in or out using controls. Clicking on a protein node provides links to UniProtKB and to RGD gene report pages. Detailed information about the protein appears in the upper right. Clicking on an edge shows additional information about that interaction. Also for more complex networks, multiple display filters can be applied. The user can pick a set of interaction types of interest, one or more species or common interactors. In addition, several layout modes common for Cytoscape graphs like ‘cose’ or ‘circle’, are available. A legend details the color-coded interaction types and protein species. The table beneath the display lists downloadable characteristics of each pair of interactors, the complete node list and node/edge statistics. The bird’s-eye view panel facilitates movement of the display. The tool also has options to generate printable reports and graph images for user convenience.

CAMSA: a Tool for Comparative Analysis and Merging of Scaffold Assemblies

Presenting Author: Max Alekseyev, George Washington University

Sergey Aganezov, George Washington University

Motivation: Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose positions and orientations along the genome are unknown. While there exists a number of methods for reconstruction of the genome from its scaffolds, utilizing various computational and wet-lab techniques, they often can produce only partial error-prone scaffold assemblies. It therefore becomes important to compare and merge scaffold assemblies produced by different methods, thus combining their advantages and highlighting present conflicts for further investigation. These tasks may be labor intensive if performed manually.

Results: We present CAMSA—a tool for comparative analysis and merging of two or more given scaffold assemblies. The tool (i) creates an extensive report with several comparative quality metrics; (ii) constructs a most consistent combined scaffold assembly; and (iii) provides an interactive framework for a visual comparative analysis of the given assemblies.

Availability: CAMSA is available for download from http://cblab.org/camsa/

Analysis of Tobacco Users Admitted to Intensive Care Units

Presenting Author: Andrey Soares, University of Colorado School of Medicine

Sonia Leach, National Jewish Health, University of Colorado School of Medicine
Kevin Cohen, University of Colorado School of Medicine
Joan Davis, Southern Illinois University

Smoking is known to cause numerous tobacco-related diseases such as cancer, heart disease, diabetes, respiratory disease, as well as death. The Center for Disease Control and Prevention warns that over 16 million Americans have some disease caused by smoking, with about 480,000 deaths in the United States. Thus, it is critical for healthcare professionals to identify and treat every tobacco user seen in any healthcare facilities. This research seeks to examine if patients, who are current tobacco users, have been correctly identified as smokers, their smoking status and behaviors have been documented, and they have received appropriate treatment recommendations (prescriptions) based on their health conditions. In particular, we will perform text analysis of the chart notes recorded during the patient stay to collect information that can be used to offer tailored treatment recommendations such as the number of cigarettes used per day, and to verify inconsistencies in documenting information about smoking. Preliminary data analysis shows that some tobacco users have not been diagnosed as smokers using the appropriate ICD9 code, leaving the information about smoking to be retrieved from the text notes or inferred from the prescribed tobacco medications. This research will also evaluate the treatment recommendations based on patient health conditions and risks, and will cluster smokers to identify emerging patterns and relationships among characteristics and diagnoses that can support tobacco intervention strategies for patients admitted to intensive care units. We will focus on comorbidities as tobacco use can trigger new diseases or complicate existing ones.

A new molecular signature approach for prediction of driver cancer pathways from transcriptional data

Presenting Author: Boris Reva, Icahn School of Medicine at Mount Sinai

Noam Beckmann, Icahn School of Medicine at Mount Sinai
Hui Li, Icahn School of Medicine at Mount Sinai
Anrew Uzilow, Icahn School of Medicine at Mount Sinai
Dmitry Rykunov, Icahn School of Medicine at Mount Sinai
Eric Schadt, Icahn School of Medicine at Mount Sinai

Assigning cancer patients to the most effective treatments requires an understanding of the molecular basis of their disease. While DNA-based molecular profiling approaches have flourished over the past several years to transform our understanding of driver pathways across a broad range of tumors, a systematic characterization of key driver pathways based on RNA data has not been undertaken. Here we introduce a new approach to predict the status of driver cancer pathways based on signature functions we constructed using weighted sums of gene expression levels derived from RNA sequencing data. To identify the driver cancer pathways of interest, we mined DNA variant data from TCGA and nominated driver alterations in seven major cancer pathways in breast, ovarian, and colon cancer tumors. The activation status of these driver pathways was then characterized using RNA sequencing data by constructing signature functions in training datasets and then testing the accuracy of the signatures in test datasets. The signature functions differentiated tumors with nominated active pathways from tumors with no genomic signs of activation very well (average AUC equals to 0.8), and they systematically exceeded the accuracies obtained by ten other known classification methods we employed as a control. A typical pathway signature is composed of ~20 biomarker genes that are unique to a given pathway and cancer type. Our results confirm that driver genomic alterations are distinctively displayed at the transcriptional level and that the transcriptional signatures can generally provide an alternative to DNA sequencing methods in detecting specific driver pathways.

Computational analysis of breakome reveals replication fork movement and elucidates mechanisms of DNA double-stranded break formation

Presenting Author: Maga Rowicka, University of Texas Medical Branch at Galveston

Yingjie Zhu, University of Texas Medical Branch at Galveston
Norbert Dojer, University of Texas Medical Branch at Galveston
Anna Biernacka, University of Warsaw
Jules Nde, University of Texas Medical Branch at Galveston
Bernard Fongang, University of Texas Medical Branch at Galveston
Razieyeh Yousefi, University of Texas Medical Branch at Galveston
Abhishek Mitra, University of Texas Medical Branch at Galveston
Ji Li, University of Texas Medical Branch at Galveston
Magdalena Skrzypczak, University of Warsaw
Andrzej Kudlicki, University of Texas Medical Branch at Galveston
Krzysztof Ginalski, University of Warsaw
Philippe Pasero, French National Center for Scientific Research

DNA double-stranded breaks (DSB) can result from endogenous processes, such as replication stress, or exogenous ones, like chemotherapeutics. We developed the method, termed BLESS, to label DSBs with single-nucleotide resolution and used it to detect them in samples with various levels of replication stress, in yeast and human. DSBs induced by replication stress such as replication fork collapse are asymmetric, which we exploited to infer fork position and reconstruct its movement. DSBs are rare events, therefore, signal-to-noise is typically low in DSB sequencing data. To address this, we built the model of the expected BLESS read pattern around an origin and used Fourier transform based filtering to improve our ability to detect pattern related to replication stress. Thus constructed model allowed us to predict 169 early origins in the budding yeast genome. Our predictions were confirmed to have at least 94% accuracy by BrdU staining. Our approach is applicable to other organisms, such as human, although accuracy of our predictions is unclear in humans due to lack of high-quality data on origin location. Our analysis also suggests putative displacement of MCM double-hexamers in close vicinity of replication origins. Finally, analysis of the breakome obtained from the BLESS method and alternative Break-Seq technology allowed us to clarify what specific types of DSBs are formed during replication fork collapse and thus infer more precisely than previously possible where in the fork vicinity the breaks take place and which among several proposed mechanisms of break creation is most likely occurring in our experiments.

HRC3 – A new class of motifs involved in chromatin organization and development.

Presenting Author: Andrzej Kudlicki, University of Texas Medical Branch

Chromatin modifications, such as methylation and acetylation of lysine residues in histone tails, are an important mechanism of epigenetic regulation. It remains unclear how the enzymes responsible for histone modifications are directed to the correct loci, in a manner that is specific to the cell type and outside stimuli.
We report the discovery of a conserved structural signature of DNA fragment that coincides with experimental binding sites of histone-modifying enzymes, such as KDM5B, KDM5A, PHF8, EZH2, RBBP5, SAP30, HDAC1 and HDAC6, also SUZ12, CHD1, SMARCB1 – involved in regulation of chromatin organization and silencing. The signature (“the HRC3 motif”) is approximately 180 base pairs long and is defined by a specific, periodic pattern in the Hydroxyl Radical Cleavage profile of a dsDNA interval. The pattern is present in both non-coding and coding sequences; in coding sequences it is produced by a very specific choice of codons in the region. The HRC3 signature is associated with several thousand genes; functional analysis show highly significant enrichment of genes involved in processes related to development (GO:0009888, GO:0048731, GO:00325020), regulation of gene expression and in DNA binding (GO:0003677). The HRC3 motifs are highly conserved, remaining unchanged from human to Drosophila. The most intriguing property of these motifs is their association with pairs or clusters of developmental transcription factors with a conserved synteny, including Hox genes. We present a model that uses HRC3s to explain the colinearity of HOX clusters in segmented animals. We also discuss their possible role in control of replication initiation.

Transforming OWL for Network Inference

Presenting Author: Tiffany Callahan, University of Colorado Denver Anschutz Medical Campus

William A. Baumgartner Jr, University of Colorado Denver Anschutz Medical Campus
Marc Daya, University of Colorado Denver Anschutz Medical Campus
Lawrence E. Hunter, University of Colorado Denver Anschutz Medical Campus-

Structural transformation of biological knowledge represented using Semantic Web standards significantly improves the utility of visualization tools and network analytics. Link prediction algorithms are powerful tools for predicting unobserved connections between nodes in a network. The application of such algorithms to biological networks has lead to the correct prediction of previously unobserved relationships ranging from protein-protein interactions to novel P53 kinases. The use of such algorithms to analyze larger and more complex representations has the potential to generate novel and important hypotheses, and insights into biological mechanisms. Unfortunately, the direct application of these algorithms to biological knowledge is limited by the representational complexity of the web ontology language standard OWL. The Network Information Content Entity (NICE) approach, a novel transformation method, reversibly transforms OWL-compliant biomedical knowledge into a representation better suited for visualization and network inference algorithms. Using several illustrative biomedical queries, the NICE transformation produces simpler network representations that are more visually comprehensible and whose structural properties (e.g. clustering coefficient, modularity, number of shortest paths, number of average neighbors, and diameter and radius) are significantly improved over raw OWL. Furthermore, comparison of the results from the application of several state-of-the-art link prediction algorithms on raw OWL versus NICE networks shows that the NICE transformation results in more accurate and biologically meaningful predictions. For each query and each algorithm, the top-ten predicted links for both OWL and NICE networks were validated via evidence from literature review and domain expert consultation. *

ShinyLearner: Enabling biologists to perform robust machine-learning classification

Presenting Author: Stephen Piccolo, Brigham Young University

Terry Lee, Brigham Young University
Shelby Taylor, Brigham Young University

Machine-learning classification is an invaluable tool for biologists. In one type of application, biomedical researchers use classification algorithms to predict whether patients will respond to a particular drug or belong to a specific disease subtype. Although the research community has developed many classification algorithms and corresponding software libraries, considerable barriers exist for non-computational biologists to take advantage of these tools. Different algorithms are written in different programming languages and require different input formats. Software libraries may require dependencies that are difficult to install, and the software may fail if incompatible versions are installed. If a researcher wanted to employ algorithms implemented in multiple software libraries, she/he may need to learn multiple programming languages and be careful to avoid biases as comparisons were made across the algorithms.

We developed ShinyLearner (https://github.com/srp33/ShinyLearner), an open-source software tool that reduces these barriers. ShinyLearner integrates several popular machine-learning libraries (e.g., scikit-learn, mlr, weka) within a Docker container that includes all software dependencies. Accordingly, ShinyLearner can be installed with ease. ShinyLearner supports Monte Carlo and k-fold cross validation and provides an option for feature selection. When multiple classification algorithms are used, ShinyLearner dynamically selects the best algorithm via nested evaluation. A simple Web interface facilitates the process of selecting parameters. Output files are in "tidy" format to enable easier processing with external tools. New algorithms can be integrated into ShinyLearner with a simple GitHub pull request.

Finally, we will describe findings from a comprehensive benchmark comparison across classification algorithms applied to 20+ gene-expression data sets.

Stratification of prostate cancer patients based on molecular interaction profiles

Presenting Author: Roland Mathis, IBM Research

Matteo Manica, IBM Research
Maria Rodriguez Martinez, IBM Research

Prostate cancer is a leading cause of cancer death amongst men, however the molecular-level understanding of disease onset and progression are largely unknown. Specifically, stratification of intermediate prostate tumor states based on current markers is difficult. The aim of this project is to integrate multi-omics data from individual patients with knowledge from literature and public databases to infer a molecular interaction network specific to prostate cancer. Inspired by the DREAM5 challenge we integrate predictions from multiple inference methods based on information theory, correlation and regression models to build a disease specific interactome. Emphasis is put on combining different data types and systematically integrating prior information using natural language processing and knowledge graphs. From the interactome we identify relevant interaction modules through graph-theory approaches. For each interaction module we cluster the patients based on molecular states measurements. The patient-specific cluster assignment vectors serve as a personalized interaction signatures and is used to stratify patients.

Medication Data Mining of Electronic Medical Records Reveal Race-Specific Prescription Patterns

Presenting Author: Benjamin Glicksberg, Icahn School of Medicine at Mount Sinai

Kipp Johnson, Icahn School of Medicine at Mount Sinai
Khader Shameer, Icahn School of Medicine at Mount Sinai
Joel Dudley, Icahn School of Medicine at Mount Sinai

Introduction: Disparities in medication availability, tolerability, and effectiveness exist and patient outcomes. We aimed to mine electronic medical records (EMR) and quantify differences in medication counts, prescription-record counts, and drug-class enrichment using the New York Metropolitan area population compiled from Mount Sinai Data Warehouse.

Methods: Self-reported ancestry was abstracted from EMR (n=2.1 million) as Caucasian (EA), African-American (AA), Hispanic/Latino (HL), Asian (A), or Other (O). Medications were normalized with RxNorm and mapped to Anatomical Therapeutic Chemical (ATC) drug-classes using the PharmaFactors software framework.

Results: We found differences in prescription and unique medication count between races (one-way ANOVA, p<5E-16 for both). AA individuals had more prescription instances and unique medications compared to all other racial groups (Tukey HSD, p<10-16, all comparisons). Conversely, HL individuals had the fewest prescription instances and unique medications compared to all other groups (Tukey HSD, p<10-16, all comparisons). Polypharmacy (4+ simultaneous drug prescriptions) varied according to race (χ2 p<10-16), EA having the highest rates (0.58) and AA the lowest (0.43). ATC drug-class enrichment varied with race: of 473 level 4 ATC classes, we found 125 and 70 enriched for EA and AA respectively (Fisher’s Exact Q<0.05, OR>1). The most enriched classes per group were EA, joint muscle pain and bowel disorders (OR=8.73 for both); AA, antiseptics (OR=8.38); HL, thiazolidinediones (OR=1.14); and A, Nucleoside/nucleotide reverse transcriptase inhibitors (OR=7.42).

Conclusion: We identified various ancestry-specific prescription data patterns. Further investigation of these patterns may help to develop prescription practices and improve therapeutic outcomes by optimizing drug efficacy and lowering side effects.

Comparative analysis of the expression patterns and regulation of histone variant genes reveals a novel epigenetic pathway related to cancer

Presenting Author: Michael Tolstorukov, Massachusetts General Hospital and Harvard Medical School

Jakub Mieczkowski, Massachusetts General Hospital and Harvard Medical School
Sihem Cheloufi, Massachusetts General Hospital and Harvard Medical School
Konrad Hochedlinger, Massachusetts General Hospital and Harvard Medical School

Minor histone variants replace canonical histones in replication-independent manner, altering chromatin structure and thereby affecting gene expression. This constitutes a distinct mechanism of genome regulation, extending the function of nucleosomes beyond ‘simple’ DNA packaging. In an unusual genomic arrangement, two genes with unique sequences (H3F3A and H3F3B), both encode the same protein – a developmentally essential histone variant H3.3. It has been recently discovered that the mutations in each of these genes occur in different cancers, including pediatric brain tumors. To understand the biological role and regulation of each of these genes we performed an integrative analysis of gene expression, chromatin organization and DNA mutability. We show that the H3.3 genes have distinct expression patterns in human cell types. This difference is most pronounced between differentiated and stem-like cells, whose expression profile resembles that of some cancers. Further analysis and experimental tests reveal that the transcription factors, including Oct4/Sox2 and N-Myc, can differentially regulate these genes. We directly demonstrate that Oct4 and Sox2 upregulate H3f3a but not H3f3b in mouse embryonic stem cells. Notably, the increased H3F3A contribution to the total H3.3 pool, i.e. its ‘transcriptional dosage’, correlates with tumor malignancy in humans. We infer that a similar increase in the H3F3A transcriptional dosage in stem-like cells enables the mutations in this gene to impact cell fate determination. Collectively, our findings provide new insights into the interplay between gene expression and DNA mutations, and point to potential therapeutic strategies in the case of the H3.3-related cancers.

The Cognoma Collaborative creates a webapp to predict cancer mutations from gene expression

Presenting Author: Daniel Himmelstein, University of Pennsylvania

Gregory Way, University of Pennsylvania
Andrew Madonna, DataPhilly
Alan Elkner, DataPhilly
Benjamin Dolly, DataPhilly
Yichuan Liu, DataPhilly
Claire McLeod, DataPhilly
Derek Goss, DataPhilly
Haitao Cai, DataPhilly
Chris Fuller, DataPhilly
Robert Paul Miller, DataPhilly
Nabeel Sarwar, DataPhilly
Branka Jokanovic, DataPhilly
Mans Singh, DataPhilly
Stephen Shank, DataPhilly
Joel Eden, DataPhilly
Karin Wolok, DataPhilly
Casey Greene, University of Pennsylvania

Precision oncology requires that we functionally categorize cancers into treatment-relevant subtypes. The predominant approach—characterizing tumors based solely on actionable mutations—struggles to detect complex changes in gene or pathway function. Alternatively, genome-wide expression profiles provide a comprehensive reflection of aberrant cellular states resulting from mutation events. Therefore, we embarked on Project Cognoma to translate between gene expression and mutation in cancer.

Cognoma is an open-source/citizen-science philanthropy being developed as a collaboration between the Greene Lab at Penn and the DataPhilly and Code for Philly meetups. This arrangement leverages the collective fullstack expertise of our diverse contributor base. Hitherto, hundreds of individuals have attended Cognoma meetups, and more than fifty have gotten involved on GitHub (https://github.com/cognoma/cognoma). Our priorities are "everyone learns something new" and "putting machine learning in the hands of cancer biologists."

Our product is cognoma.org, a webapp that makes it easy to build mutation status classifiers from gene expression on 7,306 TCGA samples representing 33 cancer types. The publicly available dataset contains RNA-seq gene expression for 20,530 genes, non-silent mutation calls for 21,940 genes, and sample attributes such as the patient's disease, age, sex, and survival. Cognoma enables a cancer biologist to assign each sample a mutation status based on one or more selected genes. Next, a disciplined classifier is trained using gene expression and sample attributes as features. As output, the user receives the importance of each feature—offering insight into the molecular effects of their chosen mutation—as well as a mutation scores for samples—which potentially identify hidden responders to targeted pharmacotherapies.

Functionally prioritizing candidate genes from genome-wide association studies

Presenting Author: Kelsey Anderson, University of Colorado School of Medicine

Sonia Leach, National Jewish Health

Genome-wide association studies (GWAS) have become the main approach for studying the genetic architecture of common diseases and traits. GWAS correlate variants at genomic loci with the trait under study. Recovery of the important genes from these loci, however, is not always straightforward. Recent evidence suggests the majority of associations found in GWAS do not change the protein-coding region of genes, but instead affect the regulation of gene transcription. Since regulatory regions like enhancers can be hundreds of kilobases away from their target gene's promoter, a locus from a GWAS may reasonably contain dozens of plausible candidate genes. Methods to computationally select or prioritize these candidates can help save researcher time and/or verify decisions. They can also suggest the underlying biology and inform mechanistic hypotheses. Here we propose a method for functionally prioritizing the candidate genes from GWAS data. We use orthogonal evidence from protein-protein interaction (PPI) networks to score each candidate, under the assumption that the true causal proteins will be functionally related. Unlike other prioritization approaches that search for dense modules in the protein network, we use a Regularized Laplacian graph kernel to measure similarity between proteins in the network. Candidates score highly if they are strongly associated with other candidates, all of whom are similar to each other according to the graph kernel. The method is evaluated against a number of existing network-based prioritization approaches on several complex disorders and traits. In nearly all cases, our method outperforms the competition.

Deriving Population-Scale Therapeutic Trajectories to Enable Precision Pharmacology

Presenting Author: Kipp Johnson, Icahn School of Medicine at Mount Sinai

Benjamin Glicksberg, Icahn School of Medicine at Mount Sinai
Khader Shameer, Icahn School of Medicine at Mount Sinai
Joel Dudley, Icahn School of Medicine at Mount Sinai

Introduction: Treatment pathways provide standard guidelines for treating the primary diseases of patients. However, patients present with comorbidities, side effects and comply poorly with treatment adherence. Availability of a precision prescription data analytics platform may help to understand factors driving better therapeutic outcomes and lower side effects.

Methods and Results: The Mount Sinai EMR contains over 18.5 million prescriptions od 1,510 unique medications. Of the entire hospital population used in this study, 803,157 (38.2%) had at least one prescription (23.25±87.21). Polypharmacy prevalence (co-administration of 4+ prescriptions) increased in an age-dependent manner, from 4% in those 0-10 years old to 62.8% in those >80. 95,373 drug pairs were enriched for co-administration (Exact-test Q<0.01). 23,656 drug-pair sequences (drug 1 followed by drug 2) were detected (Binomial Q<0.01) including the stimulants modafinil to armodafinil (OR=185), antiplatelet therapies aspirin to ticagrelor (OR=139), diabetes drugs liraglutide to canagliflozin (OR=79), antipsychotics olanzapine to haloperidol (OR=63), and drug-antidote pair naloxone and hydromorphone (OR=22). We assembled a directed network of drug trajectories with 838 nodes and 23656 edges (diameter=13) from drug pair trajectories. Greedy clustering partitioned the network into 7 subgraphs. Network hubs were detected and scored with Kleinberg’s method (principal eigenvectors of Adj(M)*t(Adj(M)). Top hub drugs were lisinopril, amlodipine, aspirin, fluticasone/salmetrol, hydrochlorothiazide, simvastatin, ergocalciferol, albuterol, furosemide, and omeprazole.

Conclusion: Systematic mining of prescription data could help to uncover relationships between therapies and outcomes and aid in the implementation of precision prescription workflows.

Comparison of Relief-F Nucleotide Differences For GWAS Data With Application to Bipolar Disorder

Presenting Author: Marziyeh Arabnejad Khanouki, University of Tulsa

Brett McKinney, University of Tulsa
Bill White, University of Tulsa

Genetic studies of the bipolar disorder disease have found an overall strong inheritance pattern for the disease but have not found specific genes with individually strong effects. Thus, there are likely many genes that increase the susceptibility to develop bipolar disorder for many individuals, but additional algorithms are needed to help identify the patterns of genes and other factors that act together to produce the illness or increase risk.
In this work, a feature selection algorithm called ReliefF is used to rank the SNPs for the cases of bipolar disorder and normal controls in two published genome-wide association studies from the National Institute of Mental Health and the Wellcome Trust Case-Control Consortium. The ranking is done through our software called “ReliefSeq”. There are two places in the ReliefSeq implementation of ReliefF where distance is used: ReliefF feature weighting and computing the nearest neighbors. The distance in ReliefSeq computed with three metrics: genotype mismatch(gm), allele mismatch(am), and Transition/Transversion(Ti/Tv). In total, nine combinations of the metrics are implemented in the analysis of 5000 SNPs.
Analysis yielded several SNPs that may have involvement in the pathophysiology of bipolar disorder. After finding the relevant genes and pathways for the high-rank SNPs, it was observed that most of the metrics combinations enriched Neuronal System and Axon-Guidance pathways in both data sets. The combinations of Ti/Tv metric performed relatively better than the other combinations in enriching the Neuronal System pathways in both data sets while most of the combinations with gm were found to be less successful in enriching the Axon-Guidance pathway.

ModEvo: A Web-Based Tool for Modeling Evolutionary Dynamics

Presenting Author: Filip Jagodzinski, Western Washington University

Rainier Harvey, Western Washington University
Jesse Sliter, Western Washington University
Elizabeth Brooks, Western Washington University
Ali Scoville, Central Washington University

Quantitative genetics is concerned with developing computational models to predict the evolution of traits in response to selection. Most models for analyzing the evolution of multiple traits employ a constant genetic variance co-variance matrix (G-Matrix). However, non-linear interactions between developmental factors underlying the production of traits can drastically affect how they co-vary.

We have developed a code-base, ModEvo, to assist in testing hypothesis about the evolutionary dynamics among multiple phenotypic traits affected by non-linear developmental interactions. Our software implements and extends a novel mathematical framework developed by Sean Rice that synthesizes concepts central to evolutionary developmental biology and quantitative genetics.

We are developing a Graphical User Interface (GUI) and the accompanying back-end infrastructure to permit biologists to interface with ModEvo via a publicly available web server. Users specify input parameters for the quantitative genetics models and invoke the back-end modeling software with a single button click. The evolutionary dynamics output by ModEvo are displayed both graphically and numerically. The front-end, back-end infrastructure uses Google Go as the back-end server and Angular as the front-end model-view controller. Our web tool is easy enough to use by a non-specialist, but also allows an experienced user to specify model parameters for a more detailed analysis.

Predicting Neural Fluctuations in the Primary Visual Cortex

Presenting Author: William Kindel, University of Colorado School of Medicine

The images we perceive are processed by our brain through evoking neural activity. Understanding how these images are turned into patterns of neural activity is an outstanding question in neuroscience. The solution will enable new therapies such as synthetic vision for those with damaged eyes because technology already exists to excite arrays of neurons-- the difficulty is knowing precisely which ones to excite. Moreover, cracking this neural code is complicated because the brain responds uniquely to repeated presentations of the same image. The number of times an individual neuron fires within a window in time fluctuates greatly.

In this presentation, I am focused on understanding and predicting these neuronal fluctuations as part of a translator that predicts the neural activity in the primary visual cortex (V1) in response to seeing any image. Fortunately, the seemingly random neuronal fluctuations contain many correlations between near neurons and over a short time window. Thus over many pairs of correlated neurons, knowing the history of one neuron’s activity can improve the predictions of another neuron's activity. Synthesizing all of this information to form my fluctuation predictor, I utilize artificial neural networks, which can, in principle, find any relationship between input and output variables. Using data from nonhuman primates, I build and then benchmark the predictor by bounding how much of neuronal noise I can predetermine. Thus, I shed light on the information stored in the neuronal fluctuations.

De novo protein structure prediction by big data and deep learning

Presenting Author: Sheng Wang, Toyota Technological Institute at Chicago

Jinbo Xu, Toyota Technological Institute at Chicago

Recently ab initio protein folding using predicted contacts as restraints has made some progress, but it requires accurate contact prediction, which by existing methods can only be achieved on some large-sized protein families. To deal with small-sized protein families, we employ the powerful deep learning technique from Computer Science, which can learn complex patterns from large datasets and has revolutionized object and speech recognition and the GO game. Our deep learning model for contact prediction is formed by two deep residual neural networks. The first one learns relationship between contacts and sequential features from protein databases, while the second one models contact occurring patterns and their relationship with pairwise features such as contact potential, residue co-evolution strength and the output of the first network. Experimental results suggest that our deep learning method greatly improves contact prediction and contact-assisted folding. Tested on 579 proteins dissimilar to training proteins, the average top L (L is sequence length) long-range prediction accuracy of our method, the representative evolutionary coupling method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; their average top L/10 long-range accuracy is 0.77, 0.47 and 0.59, respectively. Using our predicted contacts we can correctly fold 203 test proteins, while MetaPSICOV and CCMpred can do so for only 79 and 62 proteins, respectively. In the three weeks of blind test with the weekly benchmark CAMEO (http://www.cameo3d.org/), our method successfully folded three hard targets with a new fold and only 1.5L-2.5L sequence homologs while all template-based methods failed.

Identifying the mechanism for the metastatic spread of breast cancer through integration of gene expression, whole genome sequencing and functional screens.

Presenting Author: Eran Andrechek, Michigan State University

Breast cancer mortality is usually caused by metastasis to distant sites. Using genomic signatures to predict cell signaling pathway activation has allowed us to develop hypotheses about key signaling pathways that are involved in the metastatic progression of breast cancer. To test the hypothesis that the E2F transcription factors are involved in metastasis, we generated a mouse model of breast cancer lacking E2F1 or E2F2. Consistent with our hypothesis, these mice developed breast cancer lacking metastasis. The E2F family of transcription factors is classically known to regulate G1 to S-phase transition in cell cycle, however, other functions have emerged. To address the function of the E2Fs in metastasis, gene expression of tumors from wild type and E2F knockout backgrounds were analyzed. This was integrated with whole genome sequence data from matched tumor samples. Potential genetic mechanisms identified through this approach were validated for human relevance using TCGA data. Patient outcomes were screened for these genes through the application of a predictive gene expression signature. Further, the Achilles project and a drug screening study in patient derived xenograft tumors were two functional screens that were also integrated with this work. The outcome of the integrated study was the identification of an amplification event in breast cancer that correlates with metastasis. Genetic ablation of genes in this amplicon revealed specific roles in metastatic migration. Together this work demonstrates the utility of integrating multiple data platforms to address key biological problems.

Allelic Maps of Cancer

Presenting Author: Anelia Horvath, George Washington University

Paula Restrepo, GWU
Muzi Li, Georgetown University
Nawaf Alomran, Georgetown University
Piotr Słowiński, University of Exeter
Mercedeh Movassagh, GWU
Sonali Bahl, GWU
Wesley Waterhouse, GWU
Christian Miller, GWU
Chris Trenkov, GWU
Julian Manchev, GWU
Tatiyana Apanasovich, GWU
Nathan Edwards, Georgetown University
Krasimira Atanasova-Tsaneva, University of Exeter

The relationships between genome- and transcriptome dosage have been challenging to study. Allele-specific signals, especially when integrated between DNA and RNA, allow tracking chromosome-of-origin transcripts and may provide insights on the transcriptional dynamics of the aneuploid cell. We present a novel approach for building integrated RNA-DNA maps that depict allelic asymmetries, including those corresponding to aneuploidy, and enable correlation to expression features both at nucleotide resolution and continuously across genes and chromosomes. Using a currently developed in our lab software - RNA2DNAlign – we counted the reference and variant sequencing reads at every variant position in all matching datasets, and computed the variant allele fraction VAF (VAF = nvar/(nref+nvar)) where, in a pure diploid sample, VAFDNA of 0.5 corresponds to a true allelic ratio of 1:1 for heterozygote sites. To build the Allelic Maps, for all heterozygote sites in the normal DNA, we plotted VAFnDNA along the chromosomes of the matching normal exome, normal transcriptome, tumor exome, and tumor transcriptome. To define regions of aneuploidy, we adopted a model based on widely accepted notion, where aneuploidy is depicted through bimodal VAF distribution. We simulated ideal distributions corresponding to mono-, tri-, tetra-, and penta-ploid status, and tested the in the tumor VAFDNA distribution for highest similarity using the Earth Movers Distance measure (EMD). We demonstrate the application of this approach using sequencing data from human tumor tissues, cell lines, and single cell experiments.

Identifying non-specific effects of small molecule treatment through GSEA meta-analysis

Presenting Author: Rani Powers, University of Colorado Anschutz Medical Campus

Andrew Goodspeed, University of Colorado Anschutz Medical Campus
James Costello, University of Colorado Anschutz Medical Campus

Despite advancements in therapeutic strategies such as antibodies and gene therapy, small molecules remain the gold standard of treatment for numerous diseases, including cancer. Small molecules are low molecular weight compounds that rapidly diffuse across cell membranes to reach their molecular target, which is often a protein or nucleic acid. For example, many small molecule therapies inhibit the activity of a specific kinase. When investigating the effect of a small molecule on cell state or disease, researchers often compare the genome-wide mRNA levels of drug-treated cells and vehicle-treated control cells. The output of this experiment is a list of differentially expressed genes, which either increase or decrease in expression following drug treatment. This list can then be analyzed with gene set enrichment analysis (GSEA), an algorithm which performs hypergeometric tests with curated gene sets to determine which biological processes are more or less active in the drug-treated cells.

We hypothesized that even a highly-specific small molecule drug may result in non-specific effects in the cell, such as an up-regulation of generic stress response pathways. These non-specific pathways may appear as significant in the GSEA output, potentially overshadowing crucial biological processes specific to the drug under investigation. To address this problem, we aggregated several hundred gene expression experiments where human tissues or primary cells were treated with a small molecule drug. These experiments were annotated before analysis with GSEA. Our results identified pathways that are overrepresented in small molecule drug screens, providing valuable experimental and biological insight into therapeutic drug development.

Insights into Bathyarcheota Ecology and Co-occurrence Patterns as Revealed by Public Metagenome Sequencing Data

Presenting Author: David Banks-Richardson, University of Colorado-Denver

Christopher Miller, University of Colorado-Denver
Adrienne Narrowe, University of Colorado-Denver

Members of the archaeal phylum Bathyarchaeota are a major component of aquatic sediment microbial communities. To date, a comprehensive inter-domain assessment of the co-occurrence patterns between this phylum and other organisms has not been done, and surveys of Bathyarchaeota habitat preferences have relied on amplicon-based 16S rRNA studies of limited ecosystems. Our in-silico analyses suggest that commonly used primers in such 16S rRNA amplicon studies may be obscuring large portions of Bathyarchaeota phylogeny. Shotgun metagenomic sequencing has the potential to shed light onto Bathyarchaeota habitat preferences, especially for the portions of the clade that PCR-primer bias may be hiding, but shotgun assembly is often incomplete or lacks the context of existing 16S phylogeny. Here, we employ a targeted gene assembly approach (EMIRGE) to reconstruct 16S rRNA sequences from publically available shotgun metagenomes representing several environmental and host-associated habitats. We quantify the degree to which PCR-primer bias is obscuring the diversity of the Bathyarchaeota, build a cross-domain co-occurrence network between members of Bathyarchaeota and other microorganisms, and identify association patterns between environmental variables and the Bathyarchaeota. Preliminary results suggest that members of this phylum may co-occur with members of the bacterial phyla Proteobacteria, Planctomycetes, and OP1, and that sub-clades within this group respond differentially to depth gradients in estuarine sediments. This study informs future work seeking to characterize the roles these broadly distributed archaea play in microbial communities across the globe.

The SNPPhenA Corpus: An annotated research abstract corpus for extracting ranked association of single-nucleotide polymorphisms and phenotypes

Presenting Author: Hamidreza Chitsaz, Colorado state university

Behrouz Bokharaeian, Complutense University of Madrid
Alberto Diaz, Complutense University of Madrid
Ramyar Chavoshinejad, Royan Institute for Reproductive Biomedicine

Single Nucleotide Polymorphisms (SNPs) are the most comprehensively studied type of genetic variations that influence a number of diseases and phenotypes. Recently, some corpora and methods have been developed for extracting SNPs, diseases, and their associations from scientific text. However, there is no available method and corpus for extracting those SNP-disease associations that have been annotated with linguistic based negation, modality markers, neutral candidates, and the level of confidence of association. In this research, we present different steps for producing the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotating the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus has been annotated with negation scopes and cues as well as neutral candidates that have an important role in dealing with negation and the modality phenomenon in relation extraction tasks. The agreements between annotators were measured by Cohen's Kappa coefficient and the resulting scores showed reliability of the corpus. The Kappa score was 0.86 for annotating the associations and 0.80 for annotating the degree of confidence of associations. Additionally, basic statistics for extracting ranked SNP-Phenotype associations are presented here, with regard to the annotated features of the corpus besides the results of our first experiments. The guidelines and the corpus are available at http://nil.fdi.ucm.es/?q=node/639. Estimating confidence of SNP-phenotype associations could help determine phenotypic plasticity and the importance of environmental factors. Moreover, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can

Toward a Metric Learning Model for Protein Fold Recognition Using a Novel Feature Extraction Technique Based on the Mixture of Evolutionary and Secondary Structural Information

Presenting Author: Pooya Zakeri, 1)KU Leuven. 2)iMinds

Forough Amini, Institute for Research in Fundamental Sciences (IPM)
Mehdi Sadeghi, 1) National Institute of Genetic Engineering and Biotechnology, 2)Institute for Research in Fundamental Sciences (IPM)
Yves Moreau, 1) KU Leuven, 2)iMinds

It has been demonstrated that integrating the evolutionary and secondary structural information could be useful in explicating the relationship between primary and tertiary structure in proteins. In this work, we propose a protein feature extraction technique based on the blending of evolutionary and secondary structural information. A protein feature vector is created by summing up each column of the same predicted secondary structure elements in the position-specific scoring matrix and dividing by the length of the protein domain.
Then, the kernel-based kNN is employed to address the prediction of protein folds. Nevertheless, the performance of kNN-based fold predictors depends crucially on the metric used to measure distances between protein sequences. To overcome this limitation, we develop for the first time a protein fold predictor based on the large margin nearest neighbor algorithm, which learns a Mahalanobis distance metric for the kNN algorithm in a supervised fashion to improve its performance.
In order to evaluate our models in a more realistic task setting, we develop a time-stamped benchmark based on the SCOP database. Our models are trained based on the proteins of known fold discovered before a certain time using protein features released prior to that time. Then, we assess our model on the prediction of proteins of known fold reported afterwards. The experimental results on our prospective benchmark which covers about two-hundreds folds demonstrate that our developed model based on our proposed feature extraction technique can effectively improve the accuracy of the state-of-the-art protein fold predictors, such as TAXFOLD and GeoFold.

Development of a diagnostic to profile eukaryotic microbes of the human microbiome

Presenting Author: Ana Popovic, Hospital for Sick Children, University of Toronto

John Parkinson, Hospital for Sick Children, University of Toronto
Michael Grigg, National Institutes of Health

Human microbiome studies have implicated the composition of gut bacteria in function of the immune system, obesity, drug metabolism, even human behaviour. While much has been learned about the contribution of bacteria to human health and disease, few studies have addressed the role of the eukaryotic members of the microbiome. This represents a considerable gap in knowledge, as single celled eukaryotes such as Giardia, Cryptosporidium and Entamoeba infect hundreds of millions of people worldwide, and are responsible for a significant burden of gastrointestinal illnesses. In addition to pathogenic eukaryotes, several studies have identified particular species of Blastocystis and Entamoeba as residents of the healthy gut, suggesting that eukaryotic microbes play a larger role than previously appreciated in human health. A key challenge in establishing the contribution of the eukaryotic microbiome to health and disease is the lack of accurate diagnostic technology. Here, we will present our efforts to develop a new multi DNA biomarker technology, based on several hypervariable regions in the small and large ribosomal subunit genes, to accurately profile eukaryotic microbes in the human gut.

G4 quadruplexes in and near regulatory elements of maize genes predict tissue type and altered transcriptional response to abiotic stresses

Presenting Author: Mingze He, Iowa State University

Angélica Sanclemente , University of Florida
Carson Andorf, USDA
Hank W. Bass, Florida State University
Harkamal Walia, University of Nebraska-Lincoln
Justin W. Walley, Iowa State University
Karen Koch, University of Florida
Peng Liu, Iowa State University
Carolyn J. Lawrence-Dill, Iowa State University

In maize shoot tissues genes with G4-quadruplexes in or near regulatory regions respond strongly to diverse stress conditions including submergence, cold, heat UV, salt, and cold stress. GO enrichment studies indicate that differentially expressed G4-containing genes are likely to be involved in developmental processes, suggesting that altered growth rates may be a specific component of the stress response. To further investigate the function of these G4 genes, we carried out transcriptomic and proteomic analyses across 55 tissues and developmental stages in non-stress conditions. We found G4 could be applied as a marker to predict transcription rate and specific tissue type in normal tissues. In addition, co-expression network analysis between maize atlas and stressed tissues revealed G4 motifs strongly associated with transcription factors activation in response to stresses. Our results provide novel evidence to the association of G4 with emergent energy status in maize. Our findings suggest a new component in maize stress response mechanism.

Modeling heterogeneous cell populations using Boolean networks

Presenting Author: Brian Ross, University of Colorado

James Costello, University of Colorado

Cellular processes can be simulated using Monte Carlo (random sampling) methods, but these have difficulty capturing rare outcomes, particularly when the state space is huge. Yet in many cases (such as cancer) these infrequent outcomes are the ones with the most impact. Here we present a Boolean network method for modeling mixed cell populations using a single simulation, which captures these very rare subpopulations. Our method works by treating the dynamics as a system of linear equations which allow superposition of different cell populations, in a basis rotated from the state space so that the equations tend to close with relatively few variables. For cases when the variable space is still too large, we show how to efficiently remove degeneracies in our linear system as it is being built, thereby capturing the later-time evolution with a reduced system of equations. Our method generalizes to probabilistic Boolean networks, and works for both discrete and continuous time-evolution.

We evaluate our method using a >50-gene network modeling prostate cancer. Our method reproduces the results of Monte Carlo while capturing rare events that Monte Carlo cannot find. As a proof of concept, we simulate the dynamics of a mixed population spanning >10^15 different cell states with all possible combinations of loss-of- function mutations. Finally, we use these simulations to find the likely mutational trajectories of an evolving tumor in our prostate-network model. Our method can thus identify the extraordinary, as well as the typical, fates of cells.

Enhancer Reprogramming in Mammalian Genomes

Presenting Author: Mario Flores, NIH

Mario Flores, National Institutes of Health
Ivan Ovcharenko, National Institutes of Health

It has been shown that changes in regulatory regions (enhancers) have supported evolution in mammals. However there is still a lack of knowledge about the distinct types of enhancers, their identification in more tissues/cell types and the mechanisms that act to modify these regulatory regions during evolution. Here we study a type of enhancers that we have named reprogrammed enhancers. Enhancer reprogramming establish that changes in the transcription factor binding sites of noncoding regulatory DNA sequences could potentially change their regulatory function. In this context, TFBSs loss, gain and reshuffling within an enhancer can change its function (spatial and/or temporal regulatory activity). We have identified reprogrammed enhancers in 11 tissues/cell types in human and mouse. We estimate that in average 30% of the total number of enhancers in a gene locus had been reprogrammed in the course of evolution. Furthermore the analysis of DNA sequence changes underlying enhancer reprogramming shows a change in the transcription factor binding site (TFBS) composition that significantly overlaps with the TFBS composition of tissue specific enhancers. Our observations provide evidence that reprogrammed enhancers are important contributors of the shaping of the regulatory landscape during evolution.

This research is supported by the Intramural Research Program of the NIH, National Library of Medicine

The Finite State Projection based Fisher Information Matrix for the Design of Single-Cell Experiments.

Presenting Author: Zachary Fox, Colorado State University

Brian Munsky, Colorado State University

Measuring and understanding gene expression fluctuations is key to predicting and controlling gene regulation dynamics. Rapidly advancing experiments enable precise quantification of RNA and protein in single cells. However, to keep pace with expanding experimental capabilities, computational and theoretical approaches must also improve. If tightly coupled with experiments, computational analyses can extract improved insight from previous measurements and enhance the effectiveness of future experiments. The Fisher Information Matrix (FIM) is a tool that is often used to aid experiment design for engineering applications, but common FIM approaches focus on deterministic models and cannot capture the full information contained in stochastic single-cell distributions. Such distributions are known to be well captured by the chemical master equation (CME). However, the CME is frequently too difficult or impossible to solve, which precludes rigorous computation of the FIM. The finite-state projection (FSP) approach systematically reduces the CME to a finite, solvable set of ordinary differential equations. In this study, we extend the FSP to compute the FSP-FIM and estimate the expected information for potential single-cell experiments. In contrast to existing experiment design strategies, our FSP-FIM approach makes no assumptions about the underlying distributions. We demonstrate the advantage of the FSP-FIM approach on several common models of stochastic gene expression, for which previous approaches and assumptions of normal distributions are not justified. Our results allow for the computational exploration of many potential experiments, and can promote iterative and efficient integration of modeling and experimentation to understand, predict and control gene expression.

2-Scale KNN Classifications

Presenting Author: Destiny Anyaiwe, Oakland University

George D. Wilson, William Beaumont Hospital, Royal Oak, MI,
Timothy J. Geddes, William Beaumont Hospital, Royal Oak, MI,
Gautam B. Singh, Oakland University

Diverse algorithms and methods are needed to answer the ever increasing need of adequately harnessing Mass Spectrometer generated data. The unique nature and structure of mass spectra data usually, requires a high level of expertise and rigorous algorithms. This study's methodology discusses feature selection based on direct and simple mathematical observations of variables and their inter-relationships, Jackknife technique for data re-sampling, matrix to vector decomposition and successfully classifies Alzheimer's disease patients into three disease stages; age-matched controls without any evidence of dementia, patients with mild cognitive impairment and patients with clinical symptoms of Alzheimer's disease (AD), using a 2-scale principle of K-nearest neighbor (KNN) algorithm on SELDI data and without collaborating clinical records. Hitherto, there exists no clinical diagnostic tool for AD, in lieu of this, patient cognitive abilities are clinically followed-up over a period of time (may be months) to make a diagnosis. This practice usually leads to inconclusive diagnosis and results obtained from it are not generalizable. Our model provides an immediate classification and correctly classifies test data sets with 82% confidence. It can also identify traces of positive/negative change within and across data sets in regards to severity of the disease over time.

Best practices for reproducible and robust data analysis in a bioinformatics core facility

Presenting Author: James Denvir, Marshall University

Don Primerano, Marshall University
Jun Fan, Marshall University
Swanthana Rekulapally, Marshall University

With the publication of standards for Minimum Information About a Microarray Experiment in 2001, and the subsequent establishment of global repositories for gene expression and sequencing data, the research community has substantial achievements in making research data associated with published, peer reviewed manuscripts available for reuse and evaluation. However, there are currently no standards for the amount of detail of the analysis performed that should be provided in a publication. Consequently, it is rare to find publications for which the data analysis pipeline has sufficiently detailed description for the analysis to be reproduced, or in some cases critically evaluated.

We adopted simple practices used in software engineering, including version control management, self-documenting code, and convention over configuration techniques into the data analysis pipelines used in a small genomics and bioinformatics core facility. Adoption of these techniques both improved the ability of our facility to create reproducible pipelines, and enhanced operational efficiency.

The Affinity Data Bank for biophysical analysis of regulatory sequences

Presenting Author: Todd Riley, University of Massachusetts Boston

Cory Colaneri, UMass Boston
Aadish Shah, UMass Boston
Brandon Phan, UMass Boston
Pritesh Patel, UMass Boston
Zazil Villanueva, UMass Boston
Devesh Bhimsaria, University of Wisconsin-Madison

We present The Affinity Data Bank (ADB), a suite of tools that provides biologists with novel aids to deeply investigate the sequence-specific binding properties of a transcription factor (TF) or an RNA-binding protein (RBP), and to study subtle differences in specificity between homologous nucleic acid-binding proteins. Also, integrated with Pfam, the PDB, and the UCSC database, The ADB allows for simultaneous interrogation of protein-DNA and protein-RNA specificity and structure in order to find the biochemical basis for differences in specificity across protein families. The ADB also includes a biophysical genome browser for quantitative annotation of in vivo binding – using free protein concentrations to model the non-linear saturation effect that relates binding occupancy with binding affinity. Importantly, the in vivo TF and RBP protein concentrations can be inferred from transcriptome or proteome data – including RNA-seq data. The biophysical browser also integrates dbSNP and other polymorphism data in order to depict changes in affinity due to genetic polymorphisms – which can aid in finding both functional SNPs and functional binding sites. Lastly, the biophysical browser also supports biophysical positional priors to allow for quantitative designation of the in vivo, locus-specific accessibility that a protein has to the DNA. With the inclusion of these biophysical occupancy-based and affinity-based positional priors, the ADB can properly model in vivo protein-DNA binding by integrating the effects of chromatin accessibility and epigenetic marks.

Pattern-based estimation of the extent of explicit contradiction in the scientific literature

Presenting Author: Elizabeth White, University of Colorado Denver, Anschutz

K. Bretonnel Cohen, University of Colorado Denver, Anschutz
Jennifer Panzo, De La Salle University

Enormous amounts of effort are put into manually extracting findings from the scientific literature and associating them with entries in model organism databases, and as natural language processing techniques improve, it may soon be possible to accelerate that effort considerably. But, what is the reliability of statements in that literature? One way to estimate that is to look for evidence of explicit contradictions in scientific journals. We used search engines and a set of patterns that can indicate a deliberate claim of contradiction of another paper’s findings to gauge the number of contradictions in a variety of scientific fields. The findings are consistent with the conclusion that there is a large amount of contradiction in the scientific literature, and that the amount of contradictions varies considerably from one sub-field to the next.

Towards Highly Accurate Mapping of Protein Glycosylation Sites in the Human Proteome

Presenting Author: Chen Li, Monash University

Fuyi Li, Monash University
Jerico Revote, Monash University
Yang Zhang, Northwest A&F University
Geoffrey Webb, Monash University
Jian Li, Monash University
Jiangning Song, Monash University
Trevor Lithgow, Monash University

Glycosylation is a crucial and ubiquitous type of protein post-translational modification and plays an important role in cell-cell adhesion, ligand-binding and subcellular recognition. To facilitate high-throughput prediction of protein glycosylation sites, we proposed GlycoMine, a comprehensive tool for in silico identification of C-, N-, and O-linked glycosylation sites in the human proteome. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Experimental studies showed that GlycoMine significantly improved the prediction performance compared with existing prediction tools. Given the fact that little work has been done to systematically assess the importance of structural features to glycosylation prediction, we then proposed an updated version of GlycoMine, GlycoMine_struct, for improved prediction of human N- and O-linked glycosylation sites by combining sequence and structural features in an integrated computational framework with a two-step feature-selection strategy. Experiments indicated that GlycoMine_struct outperformed currently existing predictor incorporating both sequence and structure features, achieving AUC values of 0.941 and 0.922 for N- and O-linked glycosylation, respectively, on an independent test dataset. Both GlycoMine and GlycoMine_struct have been used to screen the human proteome to obtain high-confidence predictions for glycosylation sites. We anticipate that GlycoMine and GlycoMine_struct can be used as powerful computational approaches to expedite the discovery of glycosylation events and substrates to facilitate hypothesis-driven experimental studies.