Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

#ISMB2016

Sponsors

Silver:
Bronze:
F1000
Recursion Pharmaceuticals

Copper:
Iowa State University

General and Travel Fellowship Sponsors:
Seven Bridges GBP GigaScience OverLeaf PLOS Computational Biology BioMed Central 3DS Biovia GenenTech HiTSeq IRB-Group Schrodinger TOMA Biosciences

Late Breaking Research Presentations

Highlights, Late Breaking Research and Proceedings Track presentations will be presented by Theme.
Presenters names in bold (for updates and changes email steven@iscb.org)

Attention Conference Presenters - please review the Speaker Information Page available here.

TP003: FUNCTIONALLY PROFILING METAGENOMES AND METATRANSCRIPTOMES AT SPECIES-LEVEL RESOLUTION
Date:Sunday, July 10 10:10 am - 10:30 am
Room: Northern Hemisphere E1/E2
Topic: SYSTEMS / GENES
  • Eric Franzosa, Harvard T. H. Chan School of Public Health, United States
  • Lauren McIver, Harvard T. H. Chan School of Public Health, United States
  • Gholamali Rahnavard, Harvard T. H. Chan School of Public Health, United States
  • George Weingart, Harvard T. H. Chan School of Public Health, United States
  • Karen Schwarzberg, Northern Arizona University, United States
  • Luke Thompson, University of Colorado at Boulder, United States
  • Rob Knight, University of California San Diego, United States
  • J. Gregory Caporaso, Northern Arizona University, United States
  • Nicola Segata, University of Trento, Italy
  • Curtis Huttenhower, Harvard T. H. Chan School of Public Health, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Profiling microbial community function typically involves mapping millions of metagenomic or metatranscriptomic (“meta’omic”) sequencing reads against comprehensive reference sequence databases, often by translated search. In addition to being time-consuming and error-prone, this approach only provides an aggregate profile for a community, thus obscuring the contributions of individual species. To address these challenges, we designed a new tiered strategy for meta’omic functional profiling (HUMAnN2). Our method 1) rapidly identifies the species in a meta’omic sample, 2) maps sequencing reads to a sample-specific database constructed from those species’ pangenomes, and 3) only falls back to translated search for unclassified reads. In evaluations using synthetic data, HUMAnN2’s predicted functional profiles were 87% accurate at the community level (vs. 33% for pure translated search), and 79 to 91% accurate at the level of individual species. We applied HUMAnN2 to identify conserved metabolic pathways among 921 metagenomes from the Human Microbiome Project. In this task, HUMAnN2 tended to explain the majority of sample reads 10x faster than traditional search methods, thus saving 1,000s of CPU hours of compute time. Moreover, by highlighting individual species’ functional contributions, HUMAnN2 revealed new ecological patterns of functional conservation in the human microbiome (e.g. conserved metabolic pathways contributed by different species in different individuals). We expect our improvements to the performance and resolution of meta’omic functional profiling to be broadly applicable to analyses of host- and environmentally-associated microbial communities. HUMAnN2 is open-source, fully documented, and available for download now from http://huttenhower.sph.harvard.edu/humann2.

TP004: Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects
Date:Sunday, July 10 10:30 am - 10:50 am
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Florian Buettner, EMBL-EBI, United Kingdom
  • John C. Marioni, EMB-EBI, United Kingdom
  • Oliver Stegle, EMBL-EBI, United Kingdom

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Single-cell RNA-sequencing (scRNA-seq) allows heterogeneity in gene expression levels to be studied in large populations of cells. However, such heterogeneity can arise due to both technical and biological factors, thus making decomposing sources of variation extremely difficult. Current methods to dissect this heterogeneity have critical limitations as they do not scale to large datasets comprising tens of thousands of cells and in particular do not permit joint modelling of the effects of biological factors and additional unknown and confounding sources of variation. We here describe a computationally efficient model that uses latent factors to jointly infer both biological and confounding sources of gene expression variation. We validate the method using simulations, demonstrating both its accuracy and its ability to scale to large datasets with up to 100,000 cells. Moreover, through applicationmodel to the largest single-cell RNA-seq study generated to date, consisting of 49,300 retina cells, we show that our model can robustly decompose scRNA-seq datasets into interpretable components as well as facilitating the identification of novel sub-populations.

TP006: Integrating very large multi'omics data by hierarchical all-against-all association testing
Date:Sunday, July 10 10:30 am - 10:50 am
Room: Northern Hemisphere E1/E2
Topic: SYSTEMS
  • Gholamali Rahnavard, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
  • Eric A. Franzosa, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
  • Lauren J. McIver, Harvard T.H. Chan School of Public Health, United States
  • George Weingart, Harvard T.H. Chan School of Public Health, United States
  • Emma Schwager, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
  • Yo Sup Moon, Harvard T.H. Chan School of Public Health, United States
  • Xochitl C. Morgan, Harvard T.H. Chan School of Public Health, United States
  • Levi Waldron, City University of New York School of Public Health, Hunter College, United States
  • Curtis Huttenhower, The Broad Institute, Harvard T.H. Chan School of Public Health, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Modern multi’omic screens of biological samples readily produce enormous numbers of measurements, yet finding statistically significant association patterns among features within these data remains challenging, in part due to the loss of statistical power inherent with testing large numbers of hypotheses. Here, we present and validate a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for general purpose and well-powered association discovery in high-dimensional heterogeneous datasets. HAllA combines hierarchical nonparametric hypothesis testing with false discovery rate correction to enable high-sensitivity discovery of linear and non-linear associations in high-dimensional datasets (which may be categorical, continuous, or mixed). HAllA operates by 1) discretizing data to a unified representation, 2) hierarchically clustering paired high-dimensional datasets, 3) applying dimensionality reduction to boost power and potentially improve signal-to-noise ratio, and 4) iteratively testing associations between blocks of progressively more related features. We validated and optimized HAllA using synthetic datasets of known correlation structure. At a fixed false discovery rate, HAllA is consistently better-powered than naive all-against-all association testing across a range of association types. As an example application, we used HAllA to identify associations between high-throughput profiles of microbial genera and metabolites of the human gut microbiome. In addition to recapitulating known associations, we identified 60 previously unobserved associations, including between Ruminococcus and Lithocholic acid. Our implementation of HAllA is highly modular, enabling addition or substitution of alternative methods at each step, and is available with documention at http://huttenhower.sph.harvard.edu/halla.

TP007: Lightweight transcriptomics
Date:Sunday, July 10 10:50 am - 11:10 am
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Surojit Biswas, Harvard University, United States
  • Konstantin Kerner, Sainsbury Laboratory Cambridge University, Germany
  • Sandra Cortijo, Sainsbury Laboratory Cambridge University, United Kingdom
  • Varodom Charoensawan, Sainsbury Laboratory Cambridge University, United Kingdom
  • Vladimir Jojic, UNC-Chapel Hill, United States
  • Philip Wigge, Sainsbury Laboratory Cambridge University, United Kingdom

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Transcript levels are critical determinant of the proteome and hence cellular function. Because the transcriptome is an outcome of the interactions between genes and their products, we reasoned it may be accurately represented by a subset of transcript abundances. By analyzing thousands of publicly available RNA-Seq datasets, we show that the transcriptomes of A. thaliana and M. musculus are highly compressible. Capitalizing on this observation, we develop a method, Tradict, to reconstruct the expression of globally representative biological processes or the entire transcriptome with the abundances of a small, machine-learned subset of 100 transcripts. These findings suggest natural improvements to both the time and cost of performing forward genetic and small molecule drug screens, mapping eQTLs in natural populations, identifying tumor subtypes, and accurately profiling individual single-cell transcriptomes at scale.

TP013: DEVELOPMENT OF A BAYESIAN TENSOR FACTORIZATION MODEL TO PREDICT DRUG RESPONSE CURVES IN CANCER CELL LINES
Date:Sunday, July 10 12:00 pm - 12:20 pm
Room: Northern Hemisphere A1/A2
Topic: DISEASE / DATA
  • Nathan Lazar, Oregon Health & Science University, United States
  • Mehmet Gonen, Koç University, Turkey
  • Shannon McWeeney, Oregon Health & Science University, United States
  • Adam Margolin, Oregon Health & Science University, United States
  • Kemal Sonmez, Oregon Health & Science University, United States

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Biological data is inherently multi-dimensional in nature, yet most computational methods used today are based to some extent on flattening these data into two-dimensional matrices. We present a new model BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values in a three dimensional response tensor using input features for each of the dimensions. We apply this to predict full dose response curves in a panel of 599 cancer cell lines treated with 545 compounds as part of the Cancer Target Discovery and Development1 (CTD2) effort. BaTFLED learns projection matrices mapping features for cell lines and drugs into latent representations that combine to form the responses. Predictions for new cell lines, drugs or combinations of the two can be made by multiplying through these projection matrices. A Bayesian framework allows us to place distributions on the unknown variables, which encourage sparsity both row-wise in the projection matrices (for feature selection) and in the core tensor which combines the latent vectors (selecting interactions between latent representations). We train the model using a highly efficient variational method that learns optimal parameters for a distribution approximating the true posterior. This talk will explore implications of model design choices, demonstrate initial results on the CTD2 data and discuss how these methods may be applied to other multi-dimensional datasets.

TP017: Good news: we are getting better at predicting protein function
Date:Sunday, July 10 12:20 pm - 12:40 pm
Room: Northern Hemisphere A3/A4
Topic: PROTEINS / DATA
  • Predrag Radivojac, Indiana University, United States
  • Yuxiang Jiang, Indiana University, United States
  • Sean Mooney, University of Washington, United States
  • Tal Ronen-Oron, The Buck Institute for Aging Resarch, United States
  • Casey Greene, University of Pennsylvania, United States
  • Iddo Friedberg, Iowa State University, United States

Area Session Chair: Lenore Cowen

Presentation Overview: Show

Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging.

Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2.

Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction.

TP020: INTEGRATIVE COMPUTATIONAL MODELING ACROSS TUMORS REVEALS CONTEXT SPECIFIC IMPACT OF MUTATIONS
Date:Sunday, July 10 2:00 pm - 2:20 pm
Room: Northern Hemisphere A3/A4
Topic: DISEASE / GENES
  • Hatice Osmanbeyoglu, Memorial Sloan Kettering Cancer Center, United States
  • Eneda Toska, Memorial Sloan Kettering Cancer Center, United States
  • Jose Baselga, Memorial Sloan Kettering Cancer Center, United States
  • Christina Leslie, Memorial Sloan Kettering Cancer Center, United States

Area Session Chair: Paul Horton

Presentation Overview: Show

Pan-cancer analyses of somatic mutations and copy number aberrations have confirmed that the same genes or pathways are often altered across multiple tumor types. There is great interest in deploying targeted therapies in a pan-cancer manner, matching pathway-targeted drugs to the mutational profile of the tumor regardless of cancer type. However, ‘actionable mutations’ interact with distinct cancer-specific gene regulatory programs and signaling networks and can occur against different genetic backgrounds across tumor types. To better model the context-dependent role of somatic alterations, we applied a novel computational strategy for integrating parallel phosphoproteomic and mRNA sequencing data across 12 the The Cancer Genome Atlas (TCGA) tumor data sets, linking dysregulation of upstream signaling pathways with altered transcriptional response. We then developed a statistical approach to interpret the impact of mutations and copy number events in terms of functional outcomes such as altered signaling and transcription factor (TF) activity. Our analysis revealed both known and novel transcriptional regulators downstream of oncogenic pathways. These results have implications for the prospective experimental investigation of targeted therapies in tumors harboring specific mutations. Our evolving understanding of the context-dependent role of somatic alterations may potentially enhance current approaches for combinatorial clinical trial design.

TP021: Boosting alignment accuracy through adaptive local realignment
Date:Sunday, July 10 2:00 pm - 2:20 pm
Room: Northern Hemisphere E1/E2
Topic: PROTEINS
  • Dan Deblasio, University of Arizona, United States
  • John Kececioglu, University of Arizona, United States

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Mutation rates can vary across the residues of a protein, but when multiple sequence alignments are computed for protein sequences, the same choice of values for the substitution score and gap penalty parameters is often used across their entire length. We provide for the first time a new method called adaptive local realignment that automatically uses diverse alignment parameter settings in different regions of the input sequences when computing protein multiple sequence alignments. This allows parameter settings to locally adapt across a protein to more closely match varying mutation rates.

Our method builds on our prior work on global alignment parameter advising with the Facet alignment accuracy estimator. Given a computed alignment, in each region that has low estimated accuracy, a collection of candidate realignments is generated using a precomputed set of alternate parameter choices. If one of these alternate realignments has higher estimated accuracy than the original subalignment, the region is replaced with the realignment, and the concatenation of these realigned regions forms the new output alignment.

Adaptive local realignment significantly improves the quality of alignments over using the single best default parameter choice. In particular, this new method of local advising, when combined with prior methods for global advising, boosts alignment accuracy by almost 23% over the best default parameter setting on the hardest-to-align benchmarks (and almost 5.9% over using global advising alone).

A new version of the Opal multiple sequence aligner that incorporates adaptive local realignment, using Facet for parameter advising, is available free for non-commercial use at facet.cs.arizona.edu.

TP024: The Post-Genomic Era of Biological Network Alignment: Latest Insights
Date:Sunday, July 10 2:20 pm - 2:40 pm
Room: Northern Hemisphere E1/E2
Topic: SYSTEMS
  • Lei Meng, University of Notre Dame, United States
  • Vipin Vijayan, University of Notre Dame, United States
  • Tijana Milenkovic, University of Notre Dame, United States

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Analogous to genomic sequence alignment, biological network alignment (NA) aims to find regions of similarities between molecular networks of different species. NA can be divided into local (LNA) or global (GNA). LNA finds small, highly conserved network regions; GNA finds large, suboptimally conserved regions. When a new NA method is proposed, it is compared against existing methods from the same NA category. However, both LNA and GNA aim to allow for transferring functional knowledge from well- to poorly-studied species between conserved (aligned) network regions. So, which one to choose, LNA or GNA? To answer this, we introduce the first systematic evaluation of the two NA categories and new measures of alignment quality that allow for fair comparison of the different LNA and GNA outputs. We find that LNA and GNA give complementary results: LNA has high functional but low topological quality, while GNA has the opposite. Thus, we propose IGLOO, a new approach that integrates GNA and LNA. IGLOO allows for a trade-off between topological and functional alignment quality better than any existing LNA and GNA methods. NA can also be divided into pairwise NA of two networks (PNA) vs. multiple NA of more than two networks (MNA). MNA may be more useful since it can capture at once biological knowledge common to multiple species. We present multiMAGNA++, a novel and superior MNA approach, and we introduce new MNA quality measures to allow for more complete alignment characterization and more fair MNA method evaluation compared to the existing measures.

TP025: Efficient Data-Driven Model Learning for Dynamical Systems
Date:Sunday, July 10 2:40 pm - 3:00 pm
Room: Northern Hemisphere A1/A2
Topic: SYSTEMS / DATA
  • Ermao Cai, Carnegie Mellon University, United States
  • Ifigeneia Apostolopoulou, Carnegie Mellon University, United States
  • Pranay Ranjan, Carnegie Mellon University, United States
  • Paul Pan, Carnegie Mellon University, United States
  • Mark Wuebbens, Carnegie Mellon University, United States
  • Diana Marculescu, Carnegie Mellon University, United States

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

In the analysis of non-linear dynamical biological systems, it is often of interest to determine an efficient, qualitative estimate of the behavior of the state variables as opposed to exact, quantitative measures which may be intractable or too expensive to obtain. Moreover, established closed form mathematical rules governing system behavior are not always available and one may need to emulate the nature of the system on the basis of observations and experimental data only. In this paper, we propose to rely on Boolean models for analyzing dynamical systems and develop a polynomial time complexity heuristic algorithm to infer such Boolean functions for dynamical systems with refractory periods. Our algorithm is structured to perform even more efficiently for systems with a nested canalizing behavior with respect to certain features, which is indeed the case for life science applications. For data obtained from existing dynamical systems, e.g., T helper (Th) cell signaling network, T-LGL survival network, and T-cell differentiation, our algorithm is 100X faster than two other state-of-the-art methods, yet achieves similar or better accuracy.

TP026: intSKAT, an integrated Sequence Kernel Association Test, to identify novel clinically impactful somatic mutations in melanomas
Date:Sunday, July 10 2:40 pm - 3:00 pm
Room: Northern Hemisphere A3/A4
Topic: DISEASE / DATA
  • Yian Chen, Moffitt Cancer Center, United States
  • Zachary Thompson, Moffitt Cancer Center, United States
  • Jamie Teer, Moffitt Cancer Center, United States
  • Fernanda Flores, Moffitt Cancer Center, United States
  • Manali Phadke, Moffitt Cancer Center, United States
  • Zhihua Chen, Moffitt Cancer Center, United States
  • Eric Welsh, Moffitt Cancer Center, United States
  • Michael Schell, Moffitt Cancer Center, United States
  • Keiran Smalley, Moffitt Cancer Center, United States

Area Session Chair: Paul Horton

Presentation Overview: Show

INTRODUCTION
In recent years, much has been learned about the molecular basis of progression or developing therapeutic strategies based on mutation information for some of the cancer types. Taking melanoma as an example, it is known that ~50% of the melanomas have BRAF mutations and BRAF inhibitors have been developed with initial success for treatment. However, after accounting for patients with major known driver mutations: BRAF (~50%) and NRAS (~15% -20%), and NF1 (~14%), there is still ~20% of melanoma patients without clear known mutation drivers responsible for driving the development or aggressiveness of the disease. The lack of identified important non-passenger mutations in this subgroup (or any other cancer types) yields a significant challenge and also provides a great opportunity for developing therapeutic strategies. This becomes particularly important for developing personalized therapeutic strategies.
Traditionally, the driver mutations are identified through one of the following ways: if their frequencies are higher than expected some methods would determine positive selection for non-silent mutations (such as frameshift indels, nonsense and splice-site mutations) by weighting the predicted functional impact and observed frequencies. Although these methods have been shown to be useful for identifying driver mutations, at the same time, it is also understandable that these approaches will have limited power to detect infrequently mutated driver genes.
We proposed an expansive and integrated approach to link genotype to phenotypes to identify clinically relevant somatic mutations. This is accomplished by performing a flexible and powerful gene-based association test, intSKAT, to investigate the association between mutations in each gene and patients’ overall survival outcome.
METHODS
Built upon a gene-based sequence kernel association test (SKAT) [1], developed for germline studies, we developed an integrated association test, intSKAT, to identify novel somatic mutations, which are associated with clinically relevant outcome, e.g., overall survival (OS). We first coded the multi-allelic mutations into bi-allelic variants with reference versus alternative allele. Our method included an expansive suite of eight gene-based methods: 1. Burden test, 2. SKAT, 3. SKAT-O, 4-6, Burden, SKAT, SKAT-O weighted by PolyPhen-2 score, 7. Cox Regression with mutation status in a gene (0/1) as the predictor, and 8. Cox Regression with number of mutation in a gene as the predictor.
This method not only could evaluate joint effects of mutations within a gene, identify important genes with infrequent mutations, but also has the flexibility of leveraging functional predictions when available. It also allows the combinations of different directions of mutations (protective or deleterious), and different levels of functional predictions (unknown or functional prediction) to be ranked high. FDR is performed to adjust for multiple comparison within each method, and minFDR of 10-3 across all methods is used to declare the statistical significance. Furthermore, we performed robust regression to regress number of mutations within each gene against the length of longest transcript. The genes with significantly associated with OS and also higher than expected standardized residual were considered more likely to be non-passenger genes.
Using the targeted exome sequencing data in 185 melanomas patients from the Total Cancer Care (TCC) database at Moffitt Cancer Ceter we applied intSKAT to investigate the association between mutations in genes and patients’ OS as a proof of principle study. Briefly about the sequencing and variant calls, tumor samples from the TCC project were subjected to genomic capture (performed by BGI, Shenzhen using SureSelect custom designs targeting 1,321 genes, Agilent Technologies, Inc., Santa Clara, CA) and massively parallel sequencing.. Sequences were aligned to the hs37d5 human reference with the Burrows-Wheeler Aligner (BWA). Insertion/deletion realignment, quality score recalibration, and variant identification were performed with the Genome Analysis ToolKit (GATK). Sequence variants were annotated with ANNOVAR and custom scripts. We limited variants to those within the 1,321 gene target regions plus 100 flanking base pairs. High quality variants were retained by including only variants with GQ score >=15 and excluding variants in the least specific VQSR Tranche (100.00). Variant were further retained if >=80% of the samples had a high quality genotype call (reference or variants) at that position. Somatic mutations were enriched by removing variants observed >1% in 1000 Genomes, ESP African or ESP European populations. Variants were also removed if observed >1% in a set of 238 normal tissue samples subjected to the same capture and sequencing procedure. Variant were finally filtered to include only protein altering (nonsynonymous, frameshifting or non-frameshifting indels, stopgain, stoploss, and splicing variants) or only protein altering plus UTR.

In addition to performing intSKAT, we performed robust regression to further narrow down the non-passenger mutations, which can drive the disease aggressiveness in the discovery phase (Figure 1). For validation, we used the whole exome sequencing data and overall survival information from TCGA (N=211) to validate our approach (Figure 1). For validation studies, variants were limited to the 1,321 gene target regions plus 100 flanking base pairs. We decided to use real world sequencing data patients’ survival data to reflect the real-world complexity.

Finally, after identifying our top gene with mutations, we performed cell line experiments to elucidate the potential roles of the mutations in the gene(s). The melanoma cell lines Malme-3M and MeWo were purchased from ATCC. Malme-3M and MeWo cells were cultured in RPMI complete medium with 20% and 10% FBS, respectively. Cells were grown at 37°C in a 5% CO2 humidified atmosphere.

Three-dimensional spheroid assay
The three-dimensional melanoma spheroids were prepared using the liquid overlay method. Melanoma cells were added to a 96-well plate coated with agar by 72h. Spheroids were harvested and implanted into a collagen I and left to grow for 72h. Then, spheroids were washed in PBS and treated with Calcein-AM and propidium iodide for 1h at 37°C. After, pictures were taken using a Nikon-300 inverted fluorescence microscope. The percentage of invasion was determined using ImageJ software. siEPHA7 knockdown experiments were performed to investigate its effect on invation.

Inverse Matrigel invasion assay
The matrigel invasion assay was performed. Matrigel was prepared 1:1 in ice cold PBS and inserted in 8 micron pore 6.5 mm diameter uncoated Transwells into the wells of a 24 well tissue culture plate and incubated for 30 min at 37°C. Cell suspensions (1 x 105/ml) were added onto the upward facing underside of the filter and incubated in the inverted state for 4 hours. Each transwell was washed in serum free medium, 100 μl of RPMI with 10% FBS was added into the transwell and incubated for 72h at 37°C. The cells were fixed in 1 ml of 4% para-formaldehyde/0.2% Triton-X 100 and staining with 1mL of 4 μM Calcein AM solution for 1h at room temperature. The images were obtained by confocal microscopy. siEPHA7 knockdown experiments were performed to investigate its effect on invation.

RESULTS & DISCUSSION
A total of 22,848 variants were identified in the1,345 genes with 24 genes were genes near the targeted 1,321 genes. In the discovery phase, 12 genes have minFDR < 10-3. Among which, 6 genes with standardized residuals greater 2 are: ADAMTS18, DNAH8, EPHA7, LRP1B, MUC16, and TTN (p<0.008). We are in the process of downloading and processing the TCGA data for a formal validation analyses. We did a quick validation and looked up the association between mutations and OS using cBioportal. Among the 6 genes, 3 of the 6 validated (p <0.05) using this initial quick lookup through cBioportal were: EPHA7 (P Burden = 1.47x10-7 for TCC; p log-rank test = 0.03 for TCGA) and MUC16 (P Burden = 2.23x10-6 for TCC; p log-rank test = 0.015 for TCGA). TTN (P Burden = 1.22x10-6 for TCC) has similar trend observed in TCGA but p = 0.07 using log rank test. The melanoma cell lines Malme-3M and MeWo were purchased from ATCC. These cell lines contain some mutations. Both The knockdown experiments siEPHA7 using 3-D spheroid assays and Inverse Matrigel invasion assays showed that EPHA7 knockdown significantly reduce the cell invasion by 40% (p<0.01) and by 53.4% (p<0.001), respectively. The striking impact after EPHA7 knockdown on both cell survival and cell invasion showed that EPHA7 likely played a major role as a regular for metastases.
Identifying EPHA7 as a gene with important non-passenger mutations demonstrated the power of evaluating the association between infrequent mutations jointly in a gene with patients’ clinical outcomes.

CONCLUSIONS
Through our three-phase melanoma study, we have demonstrated that our proposed integrated approach, combining intSKAT and robust regression, can successfully identify novel clinically impactful genes with mutations. Our proposed method should be readily applicable to discover novel mutations for other cancer types and provide potentially important strategies for personalized treatment options.



FIGURE 1. A three-phase study design to test our proposed method intSKAT, an integrated approach to discover novel clinically impactful mutations in melanoma patients.
[Add the figure legend text here].

REFERENCES

1. Wu, M.C., et al., Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 2011. 89(1): p. 82-93.

TP029: A Weighted Exact Test for Significance of Mutually Exclusive Mutations in Cancer
Date:Sunday, July 10 3:30 pm - 3:50 pm
Room: Northern Hemisphere A3/A4
Topic: DISEASE / GENES
  • Mark Leiserson, Brown University, United States
  • Matthew Reyna, Brown University, United States
  • Benjamin Raphael, Brown University, United States

Area Session Chair: Paul Horton

Presentation Overview: Show

Large-scale cancer sequencing efforts over the past decade from consortia such as The Cancer Genome Atlas have revealed that different combinations of mutations cause cancer in different patients. One method for distinguishing the driver mutations responsible for cancer from the random mutations with no role in cancer is to search for combinations of mutations that are mutually exclusive across tumors. We introduce a new statistical test for mutual exclusivity that uses the observed number of mutations in genes and tumor samples. The statistical test weights mutations with per gene, per sample mutation probabilities. We present a formula for computing this test exactly, and derive an approximation that can compute the tail probability quickly and accurately. We demonstrate our approach by applying it to hundreds of colorectal, thyroid, and endometrial cancers.

TP032: Clonal evolution inference and visualization in metastatic colorectal cancer
Date:Sunday, July 10 3:50 pm - 4:10 pm
Room: Northern Hemisphere A3/A4
Topic: DISEASE / GENES
  • Ha X. Dang, Washington University in St. Louis, United States
  • Julie Grossman, Washington University in St. Louis, United States
  • Brian White, Washington University in St. Louis, United States
  • Steven Foltz, Washington University in St. Louis, United States
  • Christopher Miller, Washington University in St. Louis, United States
  • Jingqin Luo, Washington University in St. Louis, United States
  • Timothy Ley, Washington University in St. Louis, United States
  • Richard Wilson, Washington University in St. Louis, United States
  • Elaine Mardis, Washington University in St. Louis, United States
  • Ryan Fields, Washington University in St. Louis, United States
  • Christopher Maher, Washington University in St. Louis, United States

Area Session Chair: Paul Horton

Presentation Overview: Show


Dissecting genomic heterogeneity and clonal evolution in tumors is critical to understanding cancer progression, metastasis, and recurrence. To identify subclonal populations of cancer cells, somatic variants identified via sequencing are often clustered across tumor samples based on their variant allele frequencies (VAF) or cancer cell cellular fractions (CCF). We developed ClonEvol, a tool to infer and visualize clonal evolution models in multiple related tumor samples using pre-clustered variants. We demonstrated that ClonEvol was able to infer clonal evolution models using a published and simulated datasets. We also used ClonEvol to infer clonal evolution models for an unpublished dataset of whole genome/exome and targeted sequencing of multi organ multi region primary and metastatic tumors from a metastatic colorectal cancer cohort. We discovered that metastasis seeding in colorectal cancers were complex events that involved multiple subclones from primary and metastatic tumors. Moreover, the critical subclones that drove metastasis were often missed when a single biopsy was sequenced from the primary tumors, thus necessitated multi region sequencing in monitoring clonal evolution and identifying critical events driving metastasis. ClonEvol is available at https://github.com/hdng/clonevol

TP036: Investigating molecular determinants of ebolavirus pathogenicity
Date:Sunday, July 10 4:10 pm - 4:30 pm
Room: Northern Hemisphere E1/E2
Topic: DISEASE / PROTEINS
  • Morena Pappalardo, University of Kent, United Kingdom
  • Miguel Juliá, University of Kent, United Kingdom
  • Mark Howard, University of Kent, United Kingdom
  • Jeremy Rossman, University of Kent, United Kingdom
  • Martin Michaelis, University of Kent, United Kingdom
  • Mark Wass, University of Kent, United Kingdom

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

The West Africa Ebola virus outbreak has killed thousands of people and demonstrated the scale on which the virus threatens human life. Using extensive sequencing data obtain during the outbreak, we compare Ebolavirus genomes to identify potential molecular determinants of Ebolavirus pathogenicity. Of the five Ebolavirus species, only Reston viruses are not pathogenic in humans. We compared the Reston virus genome with those from the four human pathogenic species to identify specificity determining positions (SDPs) that are differentially conserved and may therefore act as molecular determinants of pathogenicity. We initially identified 189 SDPs using 196 Ebolavirus genome sequences. We report a reduced number of SDPs using a much larger set of sequences from the current outbreak. Structural analysis was performed to identify SDPs that are likely to have alter protein structure and function and could be associated with pathogenicity. The most striking findings were in Ebolavirus proteins VP24 and VP40. Particularly SDPs present in VP24 are likely to impair binding to human karyopherin alpha proteins and therefore prevent inhibition of interferon signaling in repsosne to viral infection. VP24 is also critical for Ebolavirus adaptation to novel hosts, and as only a few SDPs distinguish Reston virus VP24 from VP24 of other Ebolaviruses, it is possible that human pathogenic Reston viruses may emerge.

TP037: LINEs between species: Evolutionary dynamics of LINE-1 retrotransposons across the eukaryotic tree of life
Date:Monday, July 11 10:10 am - 10:30 am
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Atma Ivancevic, The University of Adelaide, Australia
  • Dan Kortschak, The University of Adelaide, Australia
  • Terry Bertozzi, South Australian Museum, Australia
  • David Adelson, University of Adelaide, Australia

Area Session Chair: Yana Bromberg

Presentation Overview: Show

LINE-1 (L1) retrotransposons are dynamic elements. They have the potential to cause great genomic change by inserting copies of themselves throughout the genome, resulting in the duplication and rearrangement of regulatory DNA. Active L1, in particular, are often thought of as tightly constrained, homologous and ubiquitous elements with well-characterised domain organisation. For the past 30 years, model organisms have been used to define L1s as 6-8kb sequences containing a 5’-UTR, two open reading frames working harmoniously in cis, and a 3’-UTR with a polyA tail.
In this study, we demonstrate the remarkable and overlooked diversity of L1s via a comprehensive phylogenetic analysis of over 500 species from widely divergent branches of the tree of life. The rapid and recent growth of L1 elements in mammalian species is juxtaposed against their decline in plant species and complete extinction in most reptiles and insects. In fact, some of these previously unexplored mammalian species (e.g. snub-nosed monkey, minke whale) exhibit L1 retrotranspositional ‘hyperactivity’ far surpassing that of human or mouse. In contrast, non-mammalian L1s have become so varied that the current classification system seems to inadequately capture their structural characteristics. Our findings illustrate how both long-term inherited evolutionary patterns and random bursts of activity in individual species can significantly alter genomes, highlighting the importance of L1 dynamics in eukaryotes.

TP045: A Framework for Integrating Co-expression Networks with GWAS to Prioritize Candidate Genes in Maize
Date:Monday, July 11 10:50 am - 11:10 am
Room: Northern Hemisphere E1/E2
Topic: SYSTEMS / GENES
  • Robert Schaefer, University of Minnesota, United States
  • Jean-Michel Michno, University of Minnesota, United States
  • Joseph Jeffers, University of Minnesota, United States
  • Owen Hoekenga, Independent Consultant, United States
  • Brian Dilkes, Purdue University, United States
  • Ivan Baxter, Donald Danforth Plant Science Center/6USDA-ARS Plant Genetics Research Unit, United States
  • Chad Myers, University of Minnesota, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Genome wide association studies (GWAS) have identified thousands of loci linked to hundreds of traits in many different species. However, in many cases, the causal genes and the cellular processes they contribute to remain unknown. This problem is even more pronounced in non-model species where functional annotations are sparse. To address these issues, we developed a computational framework called Camoco (Co-Analysis of Molecular Components) that systematically integrates loci identified by GWAS with gene co-expression networks to identify a focused set of putative causal genes that are coordinately regulated. We demonstrate the utility of our approach on new GWAS studies in maize, the world’s most produced staple crop. Using our approach, candidate SNPs associated with elemental accumulation in maize kernels were reduced by two orders of magnitude. Our study reveals the importance of gene expression data context as only root tissue-specific co-expression networks based on gene expression signatures across genotypically diverse individuals were able to provide signal for interpreting GWAS candidate SNPs. Both the software tools we developed and the lessons on integrating GWAS data with co-expression networks generalize to other contexts.

TP050: The Role of Genome Accessibility in Transcription Factor Binding in Bacteria
Date:Monday, July 11 12:00 pm - 12:20 pm
Room: Northern Hemisphere A3/A4
Topic: GENES / PROTEINS
  • Antonio Gomes, Columbia University, United States
  • Harris Wang, Columbia UNIVERSITY, United States

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

ChIP-seq enables genome-scale identification of regulatory regions that govern gene expression. However, the biological insights generated from ChIP-seq analysis have been limited to predictions of binding sites and cooperative interactions. Furthermore, ChIP-seq data often poorly correlate with in vitro measurements or predicted motifs, highlighting that binding affinity alone is insufficient to explain transcription factor (TF)-binding in vivo. One possibility is that binding sites are not equally accessible across the genome. A more comprehensive biophysical representation of TF-binding is required to improve our ability to understand, predict, and alter gene expression. Here, we show that genome accessibility is a key parameter that impacts TF-binding in bacteria. We developed a thermodynamic model that parameterizes ChIP-seq coverage in terms of genome accessibility and binding affinity. The role of genome accessibility is validated using a large-scale ChIP-seq dataset of the M. tuberculosis regulatory network. We find that accounting for genome accessibility led to a model that explains 63% of the ChIP-seq profile variance, while a model based in motif score alone explains only 35% of the variance. Moreover, our framework enables de novo ChIP-seq peak prediction and is useful for inferring TF-binding peaks in new experimental conditions by reducing the need for additional experiments. We observe that the genome is more accessible in intergenic regions, and that increased accessibility is positively correlated with gene expression and anti-correlated with distance to the origin of replication. Our biophysical model provides a more comprehensive description of TF-binding in vivo from first principles towards a better representation of gene regulation in silico, with promising applications in systems biology.

TP058: Candidate gene prioritization with Endeavour
Date:Monday, July 11 2:00 pm - 2:20 pm
Room: Northern Hemisphere E1/E2
Topic: DISEASE / DATA
  • Léon-Charles Tranchevent, , Laboratoire de Biologie et de Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, Université de Lyon, France
  • Amin Ardeshirdavani, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Sarah Elshal, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Daniel Alcaide, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Jan Aerts, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Didier Auboeuf, , Laboratoire de Biologie et de Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, Université de Lyon, France
  • Yves Moreau, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium

Area Session Chair: Judith Blake

Presentation Overview: Show

Genomic studies and high-throughput experiments often produce large lists of candidate genes among which only a few are truly relevant to the disease, phenotype, or biological process of interest. Gene prioritization tackles this problem by ranking candidate genes by profiling candidates across multiple genomic data sources and integrating this heterogenous information into a global ranking. We describe an extended version of our gene prioritization method, Endeavour, now available for 6 species and integrating 75 data sources. Validation of our results indicate that this extended version of Endeavour efficiently prioritizes candidate genes. The Endeavour web server is freely available at https://endeavour.esat.kuleuven.be/

TP062: Furthering understanding of human diseases through integrative cross-species analysis
Date:Monday, July 11 2:20 pm - 2:40 pm
Room: Northern Hemisphere E1/E2
Topic: DISEASE / DATA
  • Victoria Yao, Princeton University, United States
  • Rachel Kaletsky, Princeton University, United States
  • Coleen Murphy, Princeton University, United States
  • Olga Troyanskaya, Princeton University, United States

Area Session Chair: Judith Blake

Presentation Overview: Show

The etiology of complex human diseases is challenging to study, as they are likely a combination of many environmental and genetic factors. Elucidating the molecular basis of pathophysiologies of such diseases requires a combination of systems-level analyses in human and experimental investigations in model organisms. To fully leverage model systems to study human disease, we propose a framework that can combine human quantitative genetics results and computational models of model organism tissue biology to drive experimental screens for disruption of disease-relevant processes and identify candidate disease genes. Specifically, we develop a novel semi-supervised regularized Bayesian integration method to integrate a large compendium of heterogeneous datasets, primarily composed of publicly available expression datasets in model organism C. elegans. Using this method, we construct 203 tissue- and cell-type specific networks, and we demonstrate the accuracy of these networks in capturing tissue-specific functional signal, even for very small tissues and specific cell types. Combining these model organism functional maps with human quantitative genetics signal, we make disease gene predictions for 10 different diseases based on GWAS studies. Focusing on Parkinson’s disease, we further experimentally screen 45 of the top Parkinson's disease predictions for age-related motility defects. Analysis of 13,255 worms across 1,823 videos identifies significant age-related Parkinson's endophenotypes. Genes that correspond to strong phenotypes are prime candidates for further inquiry in human and could eventually be pursued as potential therapeutic targets.

TP064: Multi-Genome Scaffold Co-Assembly Based on the Analysis of Gene Orders and Genomic Repeats
Date:Monday, July 11 2:40 pm - 3:00 pm
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Sergey Aganezov, Computational Biology Institute & Department of Mathematics, The George Washington University, United States
  • Max Alekseyev, George Washington University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

Advances in the DNA sequencing technology over the past decades have increased the volume of raw sequenced genomic data available for further assembly and analysis. While there exist many software tools for assembly of sequenced genomic material, they often experience difficulties with reconstructing complete chromosomes. Major obstacles include uneven read coverage and presence of long similar DNA subsequences (repeats). Genome assemblers therefore often are able to reliably reconstruct only long fragments, called scaffolds. We present a method for simultaneous co-assembly of all fragmented genomes (represented as collections of scaffolds rather than chromosomes) in a given set of annotated genomes. The method is based on the analysis of gene orders and relies on the evolutionary model, which includes genome rearrangements as well as gene insertions and deletions. It can also utilize information about genomic repeats and the phylogenetic tree of the given genomes, further improving their assembly quality.

TP065: Most of the tight positional conservation of transcription factor binding sites near the transcription start site is due to their co-localization within regulatory modules
    Cancelled
Date:Monday, July 11 2:40 pm - 3:00 pm
Room: Northern Hemisphere A3/A4
Topic: GENES / PROTEINS
  • Natalia Acevedo-Luna, Iowa State University, United States
  • Leonardo Mariño-Ramírez, NIH, United States
  • Armand Halbert, NIH, United States
  • Ulla Hansen, Boston University, United States
  • David Landsman, NIH, United States
  • John Spouge, NIH, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA. Consider a “Subunit Hypothesis”: sometimes, different TF complexes contain inexact copies of a subunit that coordinates the regulation of specific genes. Then, within the RMs for the genes, transcription factor binding sites should display tightly consistent positions relative to each other, and possibly, consistent positions relative to the transcription start site (TSS), too. Our statistics found 43 significant sets of TF motifs with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a “gene group” of 135 to 3304 genes, some groups independently validated with FDR<10-4. The Subunit Hypothesis also implies that motifs corresponding to two TFs in a subunit should co-occur more than by chance alone, “enriching” the intersection of the gene groups corresponding to the two TFs. Of the 43 significant gene groups, we found 779 pairs of gene groups with significantly enriched intersections, many independently validated. A user-friendly web site at http://go.usa.gov/3kjsH permits experimental biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs. Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM that binds a particular TF subunit.

TP071: Solving the influence maximization problem on biological networks; a case study involving the cell cycle regulatory network in Saccharomyces Cerevisiae
Date:Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere BCD
Topic: DATA / SYSTEMS
  • David Gibbs, Institute for Systems Biology, United States
  • Ilya Shmulevich, Institute for Systems Biology, United States

Area Session Chair: Russell Schwartz

Presentation Overview: Show

The Influence Maximization Problem (IMP) aims to discover the set of nodes with the greatest influence on network dynamics. The problem has previously been applied in epidemiology and social network analysis. Here, we demonstrate the application to cell cycle regulatory network analysis of Saccharomyces cerevisiae.
Fundamentally, gene regulation is linked to the flow of information. Therefore, our implementation of the IMP was framed as an information theoretic problem on a diffusion network. Utilizing all regulatory edges from YeastMine, gene expression dynamics were encoded as edge weights using a variant of time lagged transfer entropy, a method for quantifying information transfer across variables. Influence, for a particular number of sources, was measured using a diffusion model based on Markov chains with absorbing states. By maximizing over different numbers of sources, an influence ranking on genes was produced.
The influence ranking was compared to other metrics of network centrality. Although ‘top genes’ from each centrality ranking contained well known cell cycle regulators, there was little agreement and no clear winner. However, it was found that influential genes tend to directly regulate or sit upstream of genes ranked by other centrality measures. This is quantified by computing node reachability between gene sets; on average, 59% of central genes can be reached when starting from the influential set, compared to 7% of influential genes when starting at another centrality metric.
Influential nodes are critical sources of information flow, potentially impacting the state of the network, potentially leading to disease.

TP073: HUMAN PROTEIN COMPLEX MAP: INTEGRATION OF 10K MASS SPECTROMETRY EXPERIMENTS
Date:Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere A3/A4
Topic: PROTEINS
  • Kevin Drew, University of Texas at Austin, United States
  • Edward Marcotte, University of Texas at Austin, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Protein complexes carry out essential functions in the cell but we currently lack knowledge of their composition, formation and function. Several recent studies using high throughput discovery of protein interactions have allowed the construction of protein complex maps but the protein overlap of these maps are limited. Here we take an integrated approach by combining protein interaction experiments from multiple published mass spectrometry datasets and construct a more complete human protein complex map. We evaluate both pairwise interactions and complexes using a novel clique-based comparison method and show improved performance over the published complex maps. Additionally, we find several new complexes including ones with enrichment for developmental disorders suggesting candidate disease genes. The expansiveness and accuracy of this complex map yields greater understanding of cellular function and provides avenues for better disease characterization.

TP075: Scalable Tools for Quantitative Analysis of Chemical-Genetic Interactions from Sequencing-Based Chemical-Genetic Interaction Screens
Date:Monday, July 11 4:10 pm - 4:30 pm
Room: Northern Hemisphere BCD
Topic: DATA / SYSTEMS
  • Scott Simpkins, University of Minnesota, United States
  • Justin Nelson, University of Minnesota, United States
  • Raamesh Desphande, University of Minnesota, United States
  • Jeffrey Piotrowski, Yumanity Therapeutics, United States
  • Sheena Li, RIKEN Institute for Sustainable Resource Science, Japan
  • Charles Boone, University of Toronto, Canada
  • Chad Myers, University of Minnesota, United States

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Recent improvements in the throughput of chemical-genetic interaction screens have necessitated the development of new, scalable pipelines for processing raw sequencing data from these experiments and interpreting the resulting chemical-genetic interaction profiles. We developed two computational tools, BEAN-counter and CG-TARGET, to respectively process and interpret the large influx of data from high-throughput chemical-genomic screens. These pipelines were applied to chemical-genetic interaction screens of more than 18,000 compounds in S. cerevisiae, ultimately yielding more than 2,000 compounds with high confidence predictions to biological process targets. We confirmed that our process-level target predictions overlap with the known functions of compounds and, importantly, enable us to discover novel compound modes-of-action. Additionally, these tools provided the foundation for new investigations into the nature of chemical interactions with biological systems.

TP076: Succinct Colored de Bruijn Graphs
Date:Monday, July 11 4:10 pm - 4:30 pm
Room: Northern Hemisphere A1/A2
Topic: GENES / DATA
  • Martin Muggli, Colorado State University, United States
  • Alex Bowe, National Institute of Informatics, Chiyoda-ku, Tokyo, Japan, Japan
  • Travis Gagie, Department of Computer Science,University of Helsinki, Finland
  • Robert Raymond, Colorado State University, United States
  • Noelle R. Noyes, Colorado State University, United States
  • Paul Morley, Colorado State University, United States
  • Keith Belk, Colorado State University, United States
  • Simon Puglisi, University of Helsinki, Finland
  • Christina Boucher, Colorado State University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

MOTIVATION: Iqbal et al. (Nature Genetics, 2012) introduced the colored de Bruijn graph, a variant of the classic de Bruijn graph, which is aimed at "detecting and genotyping simple and complex genetic variants in an individual or population".
Because they are intended to be applied to massive population level data, it is essential that the graphs be represented efficiently.
Unfortunately, current succinct de Bruijn graph representations are not directly applicable to the colored de Bruijn graph, which require additional information to be succinctly encoded as well as support for non-standard traversal operations.
RESULTS: Our data structure dramatically reduces the amount of memory required to store and use the colored de Bruijn graph, with some penalty to runtime, allowing it to be applied in much larger and more ambitious sequence projects than was previously possible. In particular, we use our method along with a custom curated database of antimicrobial resistant genes to track changes in the resistome across food production facilities. A short video of our work is available at http://cdbg.martindmuggli.com.

TP080: Interactome based drug discovery and disease-disease connections
Date:Tuesday, July 12 10:10 am - 10:30 am
Room: Northern Hemisphere A3/A4
Topic: PROTEINS / DISEASE
  • Gaurav Chopra, Purdue University, United States
  • Ram Samudrala, SUNY Buffalo, United States

Area Session Chair: Natasa Przulj

Presentation Overview: Show

We have developed a Computational Analysis of Novel Drug Opportunities (CANDO) platform (http://protinfo.org/cando/) funded by a 2010 NIH Director's Pioneer Award that analyzes compound-proteome interaction signatures to determine drug behavior, in contrast to traditional single (or few) target approaches. Our platform implements a modeling pipeline that generates an interaction matrix between 3,733 human approved drugs and 48,278 proteins using a hierarchical chem- and bio-informatic fragment-based docking with dynamics protocol (~ 1 billion predicted interactions evaluated, considering multiple binding sites per protein). The platform then uses similarity of interaction signatures across all proteins indicative of similar functional behavior and nonsimilar signatures for off- and anti-target (side) effects, in effect inferring homology of compound/drug behavior at a proteomic level. The benchmarking accuracy using this approach to rank compounds for over 650 indications/diseases is ~36%, in contrast to accuracies of ~0.2% obtained when using scrambled control matrices. We prospectively validated “high value” predictions in vitro and in vivo preclinical studies for more than a dozen indications, including type 1 diabetes, herpes, dental caries, dengue, tuberculosis, malaria, hepatitis B, and different cancers. Our drug prediction accuracy is ~35% across the nine indications, where 57/162 compounds validated thus far show comparable or better activity than an existing drug, or micromolar inhibition at the cellular level, and serve as novel repurposeable therapies. Taken together, with benchmarking accuracy and the effect of druggable protein classes on repurposing accuracy, our multitargeting results indicate that a large number of protein structures with diverse fold space and a specific polypharmacological interactome is necessary for accurate drug predictions using our proteomic and evolutionary drug discovery and repurposing platform. Our approach is broadly applicable beyond repurposing, enables personalized and precision medicine, and foreshadows a new era of faster drug and target discovery using novel disease-disease connections.

TP081: Classifying Cancer Samples by microRNA Profiles: Read the Fine Print!
Date:Tuesday, July 12 10:10 am - 10:30 am
Room: Northern Hemisphere E1/E2
Topic: DISEASE / GENES
  • Roni Rasnic, The Hebrew University of Jerusalem, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel
  • Nathan Linial, The Hebrew University of Jerusalem, Israel

Area Session Chair: Yves Moreau

Presentation Overview: Show

MicroRNAs (miRNAs) primarily function is in gene regulating and maintaining cell homeostasis. Indeed, carcinogenesis is often represented by drastic perturbations in miRNA profiles. Many cancerous tissues share similar miRNA profiles with only few dominating miRNAs. The Cancer Genome Atlas (TCGA) provides a rich resource with thousands of human samples covering >25 major cancer types. Here, we test the significant of miRNA information from TCGA in characterizing the cancer tissues and distinguish their types and tissue origin. We apply an SVM multiclass classifier for assessing the separation power between cancer types and some of their healthy tissues. The ML approach was applied to 8522 samples associated with expression data for 1047 miRNAs. We find that the set of the lowest expressed miRNAs that comprises only 0.003% of total miRNA reads has a higher separation power. Actually including the complementary set of the highly expressed miRNAs deteriorates the classification success. We are able to improve the identification following a simple discretization of the data, improving the success from 56% by the naïve usage of the miRNA profiles to ~90%. We suggest using the separation capacity of the low expressing miRNAs for characterization of metastatic tumors with unknown tissue origin. Furthermore, we gain surprising and useful insights on classes that suffer a consistent failure in identification.

TP084: RNA sequencing-based cell proliferation analysis across 19 cancers identifies a subset of proliferation-informative cancers with a common survival signature
Date:Tuesday, July 12 10:30 am - 10:50 am
Room: Northern Hemisphere E1/E2
Topic: DISEASE
  • Brittany Lasseigne, HudsonAlpha Institute for Biotechnology, United States
  • Ryne Ramaker, HudsonAlpha Institute for Biotechnology and The University of Alabama at Birmingham, United States
  • Laura Palacio, HudsonAlpha Institute for Biotechnology, United States
  • David Gunther, HudsonAlpha Institute for Biotechnology, United States
  • Sara Cooper, HudsonAlpha Institute for Biotechnology, United States
  • Richard Myers, HudsonAlpha Institute for Biotechnology, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

Despite advances in cancer diagnosis and treatment strategies, it has been difficult to identify robust prognostic signatures in cancer. Cell proliferation has long been recognized as a potential prognostic marker in cancer, but has not been investigated across multiple cancers using tissue-based RNA sequencing. Here we explore the role of cell proliferation across 19 cancers (n=6,312 patients) from The Cancer Genome Atlas project by employing a ‘proliferative index’ derived from gene expression associated with PCNA expression. This proliferative index is significantly associated with patient survival (Cox, p-value<0.05) in 8/19 cancers, which we have defined as ‘proliferation-informative cancers’ (PICs). In PICs the proliferative index is strongly correlated with tumor stage and nodal invasion. Furthermore, PICs demonstrate lower proliferation machinery expression relative to other cancers (Spearman, p=1.76E-23). Transcriptome-wide predictive survival modeling using multivariate Cox regression with L1-penalized log partial likelihood (LASSO) for feature selection outperformed the ‘proliferative-index’ in 18/19 cancers. Survival associated expression patterns were relatively unique between cancers, however PICs have a common survival signature of 86 genes (Cox, p<0.05 across all 8 cancers). Additionally, we find that proliferative index is significantly associated with somatic mutation burden (Spearman, p=1.76E-23). This study presents cancers for which cell proliferation may be an important prognostic marker and demonstrates that modern machine learning techniques can identify survival models more predictive than, and independent of, proliferative index for most cancers. We also prevent evidence for cell proliferation as a proxy for clinical parameters and confirm an association between cell proliferation and somatic mutation burden across cancers.

TP087: Data-Driven Analysis of Lymphocyte Infiltration in Breast Cancer Development and Progression
Date:Tuesday, July 12 10:50 am - 11:10 am
Room: Northern Hemisphere E1/E2
Topic: DISEASE
  • Ruth Dannenfelser, Princeton University, United States
  • Josie Ursini-Siegel, Lady Davis Institute for Medical Research, Canada
  • Vessela Kristensen, Radiumhospitalet, Norway
  • Olga Troyanskaya, Princeton University, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

The tumor microenvironment is now widely recognized for its role in tumor progression, treatment response, and clinical outcome. The intratumoral immunological landscape, in particular, has been shown to exert both pro-tumorigenic and anti-tumorigenic effects. Thus far, direct detailed studies of the cell composition of tumor infiltration have been limited; with some studies giving approximate quantifications using immunohistochemistry and other small studies obtaining detailed measurements by laboriously isolating cells from newly excised tumors and sorting them using flow cytometry. Herein we utilize a machine learning based approach to identify lymphocyte markers with which we can quantify the presence of B cells, cytotoxic T-lymphocytes, T-helper 1, and T-helper 2 cells in any gene expression data set and apply it on the studies of breast tissue. By leveraging many samples from existing large scale studies, we are able to find an inherent cell heterogeneity in clinically characterized immune infiltrates, a strong link between estrogen receptor status and infiltration in normal and tumor tissues, changes with genomic complexity, and identify characteristic differences in lymphocyte expression among molecular groupings. Furthermore, we explore the effects detailed infiltration patterns have on patient survival and changes with anti-estrogen therapy.

TP089: NUCLEOTIDE SEQUENCE COMPOSITION ADJACENT TO INTRONIC 5’ END IMPROVES TRANSLATION COSTS IN FUNGI
Date:Tuesday, July 12 11:40 am - 12:00 pm
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS / GENES
  • Zohar Zafrir, Tel Aviv University, Israel
  • Tamir Tuller, Tel Aviv University,Department of Biomedical Engineering, Israel

Area Session Chair: Natasa Przulj

Presentation Overview: Show

It is generally believed that introns are not translated; therefore, the potential intronic sequence features that may be related to the translation step (occurring after splicing) have yet to be thoroughly studied. Focusing on four fungi as model organisms (S. cerevisiae, S. pombe, A. nidulans, and C. albicans) we performed a comprehensive large scale systems biology study to characterize for the first time how translation is encoded in introns and affects their evolution. When considering the reading frame of exons upstream and adjacent to introns, we find evidence suggesting preference of intronic STOP codons close to the intronic 5’end, and that the beginning of introns is selected for codons with higher translation efficiency, presumably resulting in reduced translation and metabolic costs in cases of non-spliced introns. Ribosomal profiling data analysis in S. cerevisiae supports the conjecture that in this organism intron retention frequently occurs; thus, introns are partially translated, and their translation efficiency affects organismal fitness. We also show that this selection is stronger in highly translated and highly spliced genes, but is not associated only with genes with a specific function. Finally, we discuss the potential relation of the reported signals to efficient Nonsense-mediated decay (NMD) pathway due to splicing errors. These new discoveries, supported by population-genetics considerations, contribute to a broader understanding of intron evolution, and of how silent mutations affect gene expression and organismal fitness.

The talk is based on a paper that will be published (accepted) in the journal: DNA Research; I will also review very recent related studies (Zafrir & Tuller, RNA, 2015; Yofe* and Zafrir* et al., PLoS Genetics, 2014).

TP090: Phenotype Stratification from the Electronic Health Record using Autoencoders
Date:Tuesday, July 12 11:40 am - 12:00 pm
Room: Northern Hemisphere E1/E2
Topic: DISEASE / DATA
  • Brett K Beaulieu-Jones, University of Pennsylvania, United States
  • Jason H Moore, University of Pennsylvania, United States
  • Casey S Greene, University of Pennsylvania, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

Genetic association and on a larger scale personalized medicine require highly specific and accurate phenotypes. Research quality phenotyping is costly and can require manual clinician review. Electronic Health Records (EHRs) contain a wealth of phenotypic information but were built for clinical and billing purposes. Effectively extracting this information for research is challenging because many records are sparsely filled and labeled with billing codes. Here, we show the unsupervised use of autoencoders to model patients in the EHR. To evaluate model fit, we created a semi-supervised classifier by adding a random forest to the trained autoencoder. Semi-supervised denoising autoencoders showed classification improvements in simulation models, particularly when small numbers of patients have high quality phenotypes. Deep autoencoders with dropout effectively imputed missing data in the PRO-ACT ALS clinical trial dataset as measured both spike-in imputation accuracy. Deep autoencoder imputed data enabled more accurate ALS disease progression prediction as defined by the ALS Functional Rating System. Finally, we show that despite symptomatic heterogeneity, ALS disease progression appears homogenous with time from onset being the most important predictor.

TP095: Rapid Translation Initiation Prevents Mitochondrial Localization of mRNA
Date:Tuesday, July 12 12:20 pm - 12:40 pm
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS / GENES
  • Thomas Poulsen, National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • Kenichiro Imai, National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • Martin Frith, National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • Paul Horton, National Institute of Advanced Industrial Science and Technology (AIST), Japan

Area Session Chair: Natasa Przulj

Presentation Overview: Show

The mRNA of some, but not all, nuclear encoded mitochondrial proteins localize to the periphery of mitochondria. Previous studies have shown that both the nascent polypeptide chain and an mRNA binding protein play a role in this phenomenon, and have noted a positive correlation between mRNA length and mitochondrial localization. Here, we report the first investigation into the relationship between mRNA translation initiation rate and mRNA mitochondrial localization. Our results indicate that translation initiation promoting factors such as Kozak sequences are associated with cytosolic localization, while inhibiting factors such as 5' UTR secondary structure correlate with mitochondrial localization. Moreover, the frequencies of nucleotides in various positions of the 5' UTR show higher correlation with localization than the 3' UTR. These results suggest that rapid translation initiation may prevent mRNA mitochondrial localization. Interestingly this may explain why short mRNAs, which are thought to initiate translation rapidly, seldom localize to mitochondria. Therefore we propose a model in which translating mRNA has reduced mobility and tends not to reach the mitochondria. Finally, we explore this model with a simulation of mRNA diffusion using previously estimated translation initiation probabilities and confirmed that our model produces localization values similar to those measured in experimental studies.

TP108: Tracking the Evolution of 3D Gene Organization
Date:Tuesday, July 12 3:50 pm - 4:10 pm
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Alon Diament, Tel Aviv University, Israel
  • Tamir Tuller, Tel Aviv University, Israel

Area Session Chair: Janet Kelso

Presentation Overview: Show

The study of eukaryotic genomic organization has been rapidly advancing in recent years, with next generation sequencing technologies, such as Hi-C, providing large scale measurements of 3D genomic organization at unprecedented resolution. It has recently been shown that the distribution of genes in eukaryotic genomes is not random and that their organization is strongly related to gene expression and function. It has also been shown that some level of conservation of this organization exists between organisms. However, almost all studies of 3D genomic organization analyzed each organism independently from others.

Here we propose a novel approach for inter-organismal analysis of the evolution of the 3D organization of genes based on a network representation of Hi-C data from S. cerevisiae and S. pombe. We report global signals of conservation and re-organization of genes in the genome, that are correlated with changes in their functionality and expression. Furthermore, we describe algorithms for identifying spatially co-evolving orthologous modules (SCOMs) and demonstrate them for various proposed types of modules, including: modules of co-localizing genes with conserved 3D positions; modules of genes that underwent significant changes in their 3D co-localization during evolution; and additional more complex gene arrangements.

We show that this approach enables identifying biologically relevant modules of co-evolving genes with shared function. The approach is expected to contribute to the study of genome evolution, gene expression, and even tumorigenesis.