13th Annual Rocky Mountain Bioinformatics Conference


Max Alekseyev, PhD Max Alekseyev, PhD
Associate Professor
George Washington University
Computational Biology Institute
Ashburn, VA, USA

Title: Scaffold Assembly Based on the Analysis of Gene Orders and Genomic Repeats

Abstract: Genome sequencing technology has evolved over time, increasing availability of sequenced genomic data. Modern sequencers are able to identify only short subsequences (reads) in the supplied genomic material, which then become an input to genome assembly algorithms aimed at reconstruction of the complete genome. Such reconstruction is possible (but not guaranteed) only if each genomic region is covered by sufficiently many reads. Lack of comprehensive coverage and presence of long similar subsequences (repeats) in genomes pose major obstacles for existing assembly algorithms. They therefore often are able to reliably reconstruct only long subsequences of the genome (inter-spaced with low-coverage regions and repeats), called scaffolds.

In the current work, we address the scaffold assembly problem, i.e., reconstruction of a complete genomic sequence from scaffolds. We assume that the given scaffolds are accurate and long enough to allow identification of orthologous genes. The scaffolds then can be represented as ordered sequences of genes and we pose the scaffold assembly problem as the reconstruction of the global gene order (along genome chromosomes) from the gene sub-orders defined by the scaffolds. We view such gene sub-orders as the result of both evolutionary events and technological fragmentation in the genome. Evolutionary events that change gene orders are represented by genome rearrangements (inversions/translocations/fissions/fusions), while technological fragmentation can be modeled by artificial "fissions" that break chromosomes into scaffolds. This observation inspires us to employ the genome rearrangement analysis for scaffold assembly.

Biography:  Dr. Max Alekseyev is an Associate Professor of Mathematics and Computational Biology at the George Washington University. He holds M.S. in mathematics and Ph.D. in computer science and is a recipient of the NSF CAREER award. He was formerly an Assistant Professor of Computer Science at the University of South Carolina and served as a Scientific Director for the Laboratory for Algorithmic Biology at St. Petersburg Academic University, Russia, where he led the initial development of the genome assembler SPAdes. Dr. Alekseyev's research interests range from discrete mathematics (particularly, combinatorics and graph theory) to computational biology (particularly, comparative genomics and phylogenomics) and are focused on the development and application of new computational methods to solve old and recently emerged open biological problems.  For more information, visit: http://home.gwu.edu/~maxal/

Larry Gold, PhD Larry Gold, PhD
Chairman and Founder of SomaLogic
Professor, University of Colorado
Boulder, Colorado, USA

Title:  Proteomics 101

Abstract:   All bio-omics, and in fact all measurements, are impacted by signal-to-noise.  There are at least two kinds of noise: a first kind that results from the measurements themselves, and a second kind that results from the sum of neutral events in the long biological history of the phenomenon under study.  I mean neither “neutral fixation” nor “frozen accidents” – rather I mean events that have occurred and are not fixed or frozen, but merely present at the moment in time that we measure them.  These events (call them our SNPs or microRNAs or lncRNAs) mostly have no staying power whatsoever in the future of the organism – they are temporary snapshots of the churning of a genome.  The accumulation of mutations in a creature over the time since its speciation (the churning…) derives from nothing more than the intrinsic mutation rate (upon which function and selection depend, eventually), and thus many nucleic acid events/items may have no impact on anything, today or in the future. No one knows what fraction of DNAs or RNAs are meaningful today or might be selected in the future, but everyone agrees that whatever DNA or RNA one’s lab studies is meaningful.  Religion meets omics every day!

In a perfect world we would know which of the things we measure are and are not meaningful today.  We would then study only meaningful phenomena.  One strategy, common in the biological sciences, is to reduce what one studies to some model system one (thinks one) can grasp – the lac operon, for example, or phage lambda development – and then do “old style” (non-omics) science on a component in that model system one has decided is critical.  “Critical” is easy to declare if we mean merely essential (and, even worse, merely essential over the short time frame of a real experiment) – the rest is more difficult.  I believe that a key “proof” of how meaningful a phenomenon is might be how that phenomenon evolved: if the evolution was divergent (based on descent over time) the meaningfulness might be less certain (since drift during descent is limited to the intrinsic mutation rates of various creatures) than if we can see convergent evolution in species that were distinct for a very long time.  We might discuss a “meaningfulness” filter that gives high value to systems that converged on a solution, without obvious homologies in the sequences of the genes that converged.  I will choose examples of convergent evolution that are unlikely to have derived from lateral gene transfer…the examples will (probably) be the chicken egg white/bacteriophage T4 lysozymes and the fluoride ion riboswitches found in different bacterial species.

Protein species and their concentrations and functions in living creatures may be intrinsically less noisy/more meaningful than nucleic acids; one might say that based on some first principle: the energy utilized to make a proteins is large compared to the energy used in synthesizing a nucleic acid.  Proteomics has come a long way in the last decade, and has almost reached a point equivalent to genomics – we will need to know what measurements are important, and which are my version of “noise.”  It is now possible to quantify accurately thousands of human proteins in many matrices (blood, tissue homogenates, urine, whatever), quickly and reasonably inexpensively.  Thus proteomic data are entering the mainstream of what computer scientists and bioinformatics experts will be able to contemplate, to analyze, and (oops) to over-interpret.  I will spend a little time showing what those improvements in proteomics have been, since they are amazing.

The emerging proteomics data are now discussed in terms similar to other forms of big data derived from nucleic acid measurements.  What’s a driver, what’s irrelevant, what’s worth our time, and what is not?  This is a high class problem: proteomics data sets are just moving into an arena where the intrinsic measurements are good (not too noisy), and the deeper questions of biology can be discussed.  We will be obliged to say precise things about the mysterious networks of proteins that govern biology, and to propose specific functional networks from all possible networks in the absence of sufficient data to do so.  Hang on to your hats…

Biography: web

Casey S. Greene, PhD Casey S. Greene, PhD
Assistant Professor
University of Pennsylvania Perelman School of Medicine
Philadelphia, PA, USA

Title:  Unsupervised discovery from large gene expression compendia with ADAGE

ROCKY 2014 slides
Ciick here to view slides.

Abstract: Our overarching goal is to transform how we understand complex biological systems by developing and applying computational algorithms that effectively model processes by integrating multiple types of big data from diverse experiments. We use these methods to infer the key contextual information required to interpret such data, and facilitates both the computationally driven asking and answering of basic science and translational research questions. I will discuss a new approach, ADAGE, which integrates large-scale data from distinct experiments in an unsupervised manner. With ADAGE, a denoising autoencoder of gene expression is trained from a complete collection of genomic data and applied to generate hypotheses related to mechanisms underlying a molecular process from targeted perturbations. These unsupervised methods can be applied to un- and under-curated systems, making them broadly applicable during the age of diverse large-scale datasets.

CV: .pdf
Green Lab: web

Tom Hraha Tom Hraha
Research Scientist, Bioinformatics
SomaLogic, Inc.
Boulder, Colorado

Large-Scale Longitudinal Biomarker Discovery using SOMAscan®: Diurnal Rhythms, Pregnancy and Tuberculosis Risk

David Sterling, Urs A. Ochsner, Mary Ann de Groote, Ed Melanson, Kenneth P Wright, and Tim Bauer

The  SOMAscan® assay is a high-throughput, multiplexed proteomic technique that uses modified aptamer binding reagents termed SOMAmer® reagents to measure >4,000 proteins simultaneously. The use of this platform for large-scale biomarker discovery in blood provides an immediate measure of an individual’s phenotype with a single sample. However, longitudinal measurements from the same individuals over time has the ability to characterize physiologic processes in normal and disease states, as well a progression towards a diseased state.

After describing several unique physiological insights from longitudinal studies of human pregnancy and circadian rhythm, we will then focus on recent results from a clinical study of 6,000+ adolescents with the goal of identifying biomarkers that are correlates of risk for the progressing from latent to active Tuberculosis (TB). Although 1/3 of the world is latently infected with TB, it is not known why some progress to active infection while the majority does not. Using machine learning techniques, we have developed models capable of identifying the individuals who are most likely to progress from latent to active disease as much as a year prior to diagnosis with high accuracy. These mechanistic insights into the biology of TB progression hold promise for targeted preventative treatment of persons at higher risk, which is now accepted as vital to global eradication efforts.

Tom Hraha is a research scientist in the Bioinformatics department at SomaLogic, Inc., where his interests are in the statistical analysis of large-scale proteomic datasets with an emphasis on longitudinal analyses and predictive modeling. He is currently a member of a multi-disciplinary team funded by the Gates Foundation to use SomaLogic’s platform to develop a non-sputum based rapid diagnostic test for Mycobacterium tuberculosis.  A recipient of a National Science Foundation Fellowship, Tom obtained a Master of Science in Bioengineering from the University of Colorado at the Anschutz Medical Campus where his research focused on the network dynamics of cell signaling using image and signal processing. Prior to graduate school Tom helped start Greyledge Technologies, a company based in Vail, Colorado providing custom regenerative biologics to orthopedic operating rooms – while also skiing 75+ days a season.

Kirk E. Jordan, PhD Kirk E. Jordan
IBM Distinguished Engineer
Data Centric Systems
IBM T.J. Watson Research
Chief Science Officer
IBM Research UK

Co-Authors:  Chang-Sik Kim, Vipin Sachdeva, Martyn Winn

Title:  Data Centric Systems (DCS): Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics for the Life Sciences

Abstract: The world is awash in data from a variety of sources e.g. multi-media, sensors, and next generation sequencers.  The volume, variety, velocity and veracity data especially in the life sciences is pushing how we think about computer systems.  The problem for computing is no longer the ability to compute but the inability to move data and handle large data sets. In this talk, I will describe the work that IBM Research Data Centric Systems team is doing to develop compute systems that handle large data sets to shorten the time to solution for problems of interest to a variety of users including in the life sciences.  I will give an overview of what is our motivation, systems currently in design, and the focus on workflows and application solutions in which we are partnering with organizations to ensure appropriate impact including some details on the Trinity Code Workflow that has been shaping our thinking for the last few years.

CV: web

Carlos Oliveira, PhD Carlos Oliveira, PhD
Scientist, Biodesix, Inc
Boulder, CO, USA

Creating molecular diagnostic tests with supervised learning using time-to-event data

Creating new molecular diagnostic tests to inform prognosis or treatment benefit using sample sets for which training class labels for supervised learning are not clearly defined is a challenging task. Tests based on time-to-event outcome endpoints fit within this category, because it is not known a priori which patients should be assigned to the better/worse prognosis or treatment benefit/no benefit
groups for classifier training. This is particularly important, as time-to-event endpoints such as overall survival are normally the gold standard for assessing treatment benefit.

The Diagnostic CortexTM data analytics platform can create molecular tests using small training sets with associated time-to-event data. The class labels are defined iteratively at the same time as the classifier is trained. The platform draws on ideas from Deep Learning and, in addition, incorporates important elements focused on dealing with challenges related to having “more features than instances” so that potential overfitting is minimized. In our experience, this method has produced tests that generalize well to independent sample sets and can provide an accurate performance estimate even when no validation set is available. The method also allows for the tuning of the classifier to meet clinically relevant performance criteria, such as avoiding confounders. Although initially developed to work with matrix-assisted laser desorption/ionization time of flight (MALDI-TOF) spectral data, the Diagnostic Cortex can also be used to analyze other kinds of data, for example mRNA expression.

Examples of classifiers in oncology developed with the Diagnostic Cortex from both serum proteomic MALDI-TOF MS and tissue mRNA expression data, including a test with clinical utility in immunotherapies, will be presented.

Biography: Dr. Oliveira earned his PhD in 2011, from the University of Aveiro in Portugal, for his work in the creation of simulation tools that help in the design of electroluminescent noble gaseous detectors, with application in medical imaging and particle physics. He held a post-doctoral fellowship at Lawrence Berkeley National Laboratory, Berkeley, CA, studying the properties of alternative gas mixtures with the aim of improving the performance of gaseous Time Projection Chambers and enabling their use in Neutrino and Dark Matter research. Dr. Oliveira joined Biodesix Inc., in June 2014 as a Scientist and is part of the molecular diagnostics company’s new classifier development team focused on clinically actionable, serum and mass spectrometry‐based diagnostic tests for patients with cancer.

Karin Verspoor, PhD Karin Verspoor
Associate Professor
Department of Computing and Information Systems
Deputy Director, Health and Biomedical Informatics Centre
University of Melbourne, Australia

Title: Accelerating Biomedical Discovery through large-scale Heterogeneous Data Integration

Biomedical research has advanced rapidly in recent years, producing an unprecedented amount of data and knowledge. This data is increasingly complex, ranging from the output of next generation sequencing of DNA to gene expression data to large repositories of health-related patient data. While much of this data is structured, there has been a simultaneous explosion in unstructured data, captured in the published biomedical literature and in electronic health records. It is becoming increasingly challenging for biomedical researchers to keep up with this explosion of data, and automated strategies for processing, interpreting, and contextualising it are required. In this presentation, I will discuss the use of text mining techniques for extraction of information from the biomedical literature, and demonstrate how text-derived information can be combined with other data resources to support analysis of biological data sets. I will hint at a range of problems that might be tackled with heterogeneous data analysis incorporating text resources, and specifically discuss applications in protein function prediction and analysis of genetic variants that are promising examples of the approach.

Biography: Karin Verspoor is Associate Professor in the Department of Computing and Information Systems and Deputy Director of the Health and Biomedical Centre at the University of Melbourne. She was formerly the Scientific Director of Health and Life Sciences at NICTA Victoria Research Laboratory, Principal Researcher and leader of the NICTA Biomedical Informatics team. Trained as a computational linguist, Karin’s research is primarily focused on text mining and data analytics of clinical texts and the biomedical literature to support biological discovery and clinical decision support. Karin held previous posts at the University of Colorado School of Medicine, Los Alamos National Laboratory, was a post-doctoral researcher at Macquarie University in 1997-98, and spent 5 years in start-ups during the US Tech bubble around Y2K.

Lei Xie, Ph.D. Lei Xie, PhD
Associate Professor
Department of Computer Science
Hunter College, the City University of New York

Title:  Precision drug rescue and drug repurposing using structural systems pharmacology

Abstract:  Precision medicine is an emerging method for disease treatment and prevention that takes into consideration individual genetic and environmental variability for each person. However, the advance of precision medicine is hindered by a lack of mechanistic understanding of the energetics and dynamics of drug-target and genetic interactions in the context of the whole human interactome. To address this challenge, we have developed a novel structural systems pharmacology approach to elucidate molecular basis and genetic biomarkers of drug action. Our approach combines big data analytics and mechanism-based modeling through integrating structural genomic, functional genomic, metabolomics, and interactomic data. By searching for all structurally-characterized human proteins and applying molecular modeling and machine learning, we are able to construct genome-scale high-resolution drug-target interaction models. Subsequently, we link the putative off-targets to genome-scale biological networks to identify drug modulation pathways and cryptic genetic factors. As proof-of-concept studies, we have applied our structural systems pharmacology approach to drug rescue and drug repurposing for precision medicine. We have identified cryptic genetic factors that account for the side effect of Torcetrapib, a cholesterol-lowering drug that failed in phase III clinical trial due to serious side effects. Recently, we have revealed molecular and genetic mechanisms of metformin, enabling us to repurpose metformin as a precision anti-cancer therapy. The predicted molecular targets of metformin were experimentally validated. Our results shed new light on repurposing metformin as safe, effective, personalized therapies, and demonstrate that structural systems pharmacology is a potential powerful tool to facilitate the development of precision medicine.

CV: web