Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

#ISMB2016

Sponsors

Silver:
Bronze:
F1000
Recursion Pharmaceuticals

Copper:
Iowa State University

General and Travel Fellowship Sponsors:
Seven Bridges GBP GigaScience OverLeaf PLOS Computational Biology BioMed Central 3DS Biovia GenenTech HiTSeq IRB-Group Schrodinger TOMA Biosciences

Theme Presentation Schedule

Highlights, Late Breaking Research and Proceedings Track submissions are presented by scientific theme as part of the combined Theme Presentation schedule.
Presenters names in bold (for updates and changes email steven@iscb.org)

Attention Conference Presenters - please review the Speaker Information Page available here.

(PT) - Adaptive local realignment via parameter advising
    Cancelled
Date: Sunday, July 10
Room: TBA
Theme: Sequence Analysis

    Presentation Overview: Show

    Mutation rates can vary across the residues of a protein, but when multiple sequence alignments are computed for protein sequences, typically the same choice of values for the substitution score and gap penalty parameters is used across the entire protein. We provide for the first time a new method called adaptive local realignment, which computes protein multiple sequence alignments that automatically use diverse alignment parameter settings in different regions of the input sequences. This allows the aligner’s parameter settings to locally adapt across a protein to more closely follow varying mutation rates.

    Our method builds on the Facet alignment accuracy estimator, and our prior work on global alignment parameter advising. In a computed alignment, for each region that has low estimated accuracy, a collection of candidate realignments is generated using a set of alternate parameter choices. If one of these alternate realignments has higher estimated accuracy than the original subalignment, it is replaced.

    Adaptive local realignment significantly improves the quality of alignments over using the single best default parameter choice. In particular, local realignment, when combined with existing methods for global parameter advising, boosts alignment accuracy by almost 24% over the best default parameter setting on the hardest-to-align benchmarks.

    A new version of the Opal multiple sequence aligner that incorporates adaptive local realignment, using Facet for parameter advising, is available free for non-commercial use at http://facet.cs.arizona.edu. This site also contains the benchmarks from our experiments, and optimal sets of parameter choices.

    002 - The Hague (PT) - ECCB 2016 Presentation
    Date: Monday, July 11th 08:45 am - 9:00 am
    Room: BCD
    Theme:

      Presentation Overview: Show

      003 - PhRMA (PT) - PhRMA Award Presentations
      Date: Monday, July 11th 08:45 am - 9:00 am
      Room: BCD
      Theme:

        Presentation Overview: Show

        AKES/AKES02_ISMB2016_Crus (PT) - Common Workflow Language
        Date: TBA
        Room: TBA
        Theme:

          Presentation Overview: Show

          AKES/AKES02_ISMB2016_Gaur (PT) - Enabling National-Scale Genomics on the Cloud
          Date: TBA
          Room: TBA
          Theme:

            Presentation Overview: Show

            AKES/AKES02_ISMB2016_OCon (PT) - Dockstore Tutorial
            Date: TBA
            Room: TBA
            Theme:

              Presentation Overview: Show

              AKES/AKES04_ISMB2016-Fre (PT) - Utilizing genomic data in clinical systems
              Date: TBA
              Room: TBA
              Theme:

                Presentation Overview: Show

                AKES/AKES04_ISMB2016-Frey (PT) - Collection of data for research
                Date: TBA
                Room: TBA
                Theme:

                  Presentation Overview: Show

                  AKES/AKES04_ISMB2016-Mad (PT) - Practical Precision Medicine: Integration of clinical and genomic data to support cancer care
                  Date: TBA
                  Room: TBA
                  Theme:

                    Presentation Overview: Show

                    AKES/AKES04_ISMB2016-Over (PT) - Increasing the reach of clinical genomics research and genomics-informed care
                    Date: TBA
                    Room: TBA
                    Theme:

                      Presentation Overview: Show

                      AKES/AKES04_ISMB2016-Tene (PT) - Ethical, legal, and social implications of genomic testing
                      Date: TBA
                      Room: TBA
                      Theme:

                        Presentation Overview: Show

                        AKES/AKES04_ISMB2016-Volc (PT) - Overview of bioinformatics techniques and uses in clinical practice and research
                        Date: TBA
                        Room: TBA
                        Theme:

                          Presentation Overview: Show

                          Awards (PT) - ISCB Distinguished Fellows 2016
                          Date: Sunday, July 10
                          Room: BCD
                          Theme:

                            Presentation Overview: Show

                            ClosingAwards (PT) - Awards & Closing
                            Date: Tuesday, July 12 5:40 pm - 6:00 pm
                            Room: BCD
                            Theme:

                              Presentation Overview: Show

                              COSI 2 (PT) - STRING: Protein networks from data and text mining
                              Date: Sunday, July 10 10:30 am - 10:50 am
                              Room: BCD
                              Theme:

                                Presentation Overview: Show

                                COSI 4 (PT) - Patient-Specific Network Data Fusion for Stratification, Biomarker Discovery and Personalizing Treatment
                                Date: Sunday, July 10 11:40 am - 12:00 pm
                                Room: BCD
                                Theme:

                                  Presentation Overview: Show

                                  COSI 5 (PT) - Whole-cell models: combining genomics and dynamical modeling
                                  Date: Sunday, July 10 12:00 pm - 12:20 pm
                                  Room: BCD
                                  Theme:

                                    Presentation Overview: Show

                                    KN03 (PT) - Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem Cell Fate Trajectories
                                    Date: Monday, July 11th 9:00 am - 10:00 am
                                    Room: BCD
                                    Theme:

                                      Presentation Overview: Show

                                      KN04 (PT) - Understanding Cellular Heterogeneity
                                      Date: Monday, July 11th 4:40 pm - 5:40 pm
                                      Room: BCD
                                      Theme:

                                        Presentation Overview: Show

                                        KN05 (PT) - Personalized Genomics and Computation
                                        Date: Tuesday, July 12 9:00 am - 10:00 am
                                        Room: BCD
                                        Theme:

                                          Presentation Overview: Show

                                          OP01 (PT) - Assessing the Differential Significance of Transcription Factors
                                          Date: Sunday, July 10 10:10 am - 11:10 am
                                          Room: America's Seminar
                                          Theme: Systems Biology and Networks

                                            Presentation Overview: Show

                                            While the general process of gene transcription is well understood, the mechanisms by which different genes are activated in different conditions or different cell types are not. Transcription must be precisely controlled for proper development and response to differing conditions, and determining exactly which part of the cellular machinery is responsible for changes in expression is an important task in biology. In order to determine exactly which transcription factors are responsible for very specific conditions, it can be helpful to examine which genes are differentially expressed in similar but slightly different conditions. Here, we consider the problem of taking two closely related differentially expressed gene sets and determining which transcription factors could be responsible for the differences. While identifying transcription factors whose targets are significantly enriched in a set of differentially expressed genes is a common computational task, here we address a subtly but importantly different question: which transcription factors' targets are more significantly overrepresented in one set than another. We present approaches to rank transcription factors based on their regulation of one set of genes as compared to another and apply them to gene expression sets associated with the Mediator complex, a complex essential for most transcription in eukaryotes which may play an important role in differential transcription. We apply our methods to investigate the regulatory differences between CDK8 and CDK19, homologous proteins that function similarly and can alternatively occupy the same position in Mediator. We show that our methods perform substantially better than naïve methods.

                                            OP02 (PT) - Anti-aging and aging molecular networks database
                                                Cancelled
                                            Date: Sunday, July 10 10:10 am - 11:10 am
                                            Room: America's Seminar
                                            Theme: Systems Biology and Networks

                                              Presentation Overview: Show

                                              Motivation: Most of the existing aging-related databases provide fragmented information that is limited to individual molecules or genome expression profiles. However, actual aging mechanisms are the result of net-works of interactions between numerous molecules, rather than simply being determined by individual genes/proteins. In order to fully understand such complex mechanisms, a comprehensive network of diseases, drugs, and molecules is required. Therefore, in this study we aimed to integrate several public databases to in order to construct an anti-aging related meta-database.
                                              Results: A network database of approximately 40,000 edges was constructed from the literature and from pub-licly available databases. Open anti-aging related databases, protein interactions, drugs, biochemical, diseases, and signaling pathways were retrieved from public databases. Other relevant information was extracted from NCBI articles using text mining. We constructed a user-friendly web server to represent molecular information from the anti-aging related network. Unlike previous molecule-centered databases, our server provides a collec-tion of up-to-date research and their results at the network level, which will aid searches for anti-aging related information, and make it possible to discover new and significant insights into the mechanisms that underlie aging (http://antiaging2.labkm.net).

                                              OP03 (PT) - Fair Evaluation of Global Network Aligners
                                              Date: Sunday, July 10 10:10 am - 11:10 am
                                              Room: America's Seminar
                                              Theme: Systems Biology and Networks

                                                Presentation Overview: Show

                                                Analogous to genomic sequence alignment, biological network alignment identifies conserved regions between networks of different species. Then, functional knowledge can be transferred from well- to poorly-annotated species between aligned network regions. Network alignment typically encompasses two algorithmic components: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on state-of-the-art methods at the time, MI-GRAAL and IsoRankN, that combining NCF of one method and AS of another method can give a new superior method. More recently, we further confirmed this by mixing and matching MI-GRAAL’s and GHOST’s NCFs and ASs. Most recently, we introduced a novel AS called Weighted Alignment VotEr (WAVE). When used on top of well-established NCFs of the existing methods (such as MI-GRAAL or GHOST), WAVE improves alignment quality compared to the existing methods.

                                                OP04 (PT) - Estimation of ribosome profiling performance and reproducibility at various levels of resolution
                                                Date: Sunday, July 10 10:10 am - 11:10 am
                                                Room: America's Seminar
                                                Theme: Systems Biology and Networks

                                                  Presentation Overview: Show

                                                  Ribosome profiling (or Ribo-seq) is currently the most popular methodology for studying translation; it has been employed in recent years to decipher various fundamental gene expression regulation aspects.

                                                  The main promise of the approach is its ability to detect ribosome densities over an entire transcriptome in high resolution of single codons. Indeed, dozens of ribo-seq studies have included results related to local ribosome densities in different parts of the transcript; nevertheless, the performance of ribo-seq has yet to be quantitatively evaluated and reported in a large-scale multi-organismal and multi-protocol study of currently available datasets.

                                                  Here we provide the first objective evaluation of Ribo-seq at the resolution of a single nucleotide(s) using clear, interpretable measures, based on the analysis of 15 experiments, 6 organisms, and a total of 712,168 transcripts. Our major conclusion is that the ability to infer signals of ribosomal densities at nucleotide scale is considerably lower than previously thought, as signals at this level are not reproduced well in experimental replicates. In addition, we provide various quantitative measures that connect the expected error rate with Ribo-seq analysis resolution.

                                                  OP05 (PT) - Urothelial cancer cell line models of tumor biology and drug response
                                                  Date: Sunday, July 10 10:10 am - 11:10 am
                                                  Room: America's Seminar
                                                  Theme: Bioinformatics of Disease and Treatment

                                                    Presentation Overview: Show

                                                    The utility of tumor-derived cell lines is dependent on their ability to recapitulate the underlying genomic aberrations found in primary tumor biology. Here, we analyze the exome sequences of 25 bladder cancer (BCa) cell lines and compared mutations, copy number alterations, gene expression and drug response to BCa patient samples in The Cancer Genome Atlas (TCGA). We show that the genomic aberrations found in BCa cell lines mimic patient samples, including similar mutation patterns associated with altered CpGs and APOBEC-family cytosine deaminases, activating mutations in the TERT promoter, mutations in known BCa-associated genes (TP53, RB1, CDKN2A and TSC1), and alterations in chromatin associated proteins (MLL3, ARID1A, CHD6 and KDM6A). We confirmed non-silent sequence alterations in 76 cancer-associated genes. Next, we used PARADIGM to infer pathway activities for cisplatin treated BCa cell lines based on the cell lines’ gene expression and copy number data. We used the inferred pathway activities to build a predictive model of platinum drug response. The predictive model was based on an elastic net regression, which provided an implicit feature selection that identified important pathway concepts relevant to cisplatin response. When applied to BCa patients gathered from TCGA, the model predicted overall response, showing a clear separation in survival of predicted nonresponders vs predicted responders in the platinum-treated patient cohort (p=0.05) and no separation in the untreated patient cohort (p=0.62). Together, these data and predictive models represent a valuable community resource to model basic tumor biology and to study the pharmacogenomics of BCa.

                                                    OP06 (PT) - The Landscape of Circular RNA in Cancer
                                                    Date: Sunday, July 10 10:10 am - 11:10 am
                                                    Room: America's Seminar
                                                    Theme: Sequence Analysis

                                                      Presentation Overview: Show

                                                      Circular RNAs (circRNA) are a new class of abundant, non-adenylated, and stable RNAs that form a covalently closed loop. Recent studies have suggested that circRNAs play important regulatory roles through interactions with miRNAs and ribonucleoproteins. High-throughput RNA-sequencing to detect circRNAs requires non-poly(A) selected protocols. In this study, we established the use of Exome Capture RNA-Seq protocol to profile circRNAs across more than 1000 human cancers samples. We validated our protocol against two other gold-standard methods, depletion of rRNA (Ribo-Zero) and digestion of linear transcripts (RNase-R). Capture RNA-seq was shown to greatly facilitate the high-throughput profiling of circRNAs, providing the most comprehensive catalogue of circRNA species to-date. Specifically, our method achieved significantly better enrichment for circRNAs than rRNA depletion, and, unlike RNase-R treatment, preserved accurate circular-to-linear ratios. Although the correlation between circular and linear isoform abundance was modest in general , we found strong evidence that the lineage specificity of circular RNAs is due to the lineage specificity of their parent genes. To shed light on the mechanism of circRNAs biogenesis, we are investigating the associations between mutations in canonical splicing sites and splicing factors with aberrant formation of circRNAs. Finally, ratio of circular to linear transcript abundance was explored to give insight in the dynamics between transcriptome stability/turnover and cell proliferation. Overall, our compendium provides a comprehensive resource that could aid the exploration of circRNAs as a new type of biomarkers, or as intriguing splicing and regulatory phenomena.

                                                      OP07 (PT) - Defining the genetic overlap of patient tumors and patient derived xenograft models in colorectal cancer
                                                      Date: Sunday, July 10 10:10 am - 11:10 am
                                                      Room: America's Seminar
                                                      Theme: Bioinformatics of Disease and Treatment

                                                        Presentation Overview: Show

                                                        Introduction

                                                        Colorectal Cancer is one of the most common forms of cancer and is the second leading cause of cancer deaths in the world. While patient-derived xenograft models have emerged as an important tool to study tumor growth, progression and response to therapy, the extent to which they recapitulate the genetic features of the primary tumors is unknown. In this study, we compare colorectal patient tumors and patient-derived tumor xenografts (PDX) obtained from the same patient to identify recurrently mutated genes and their overlap in colorectal cancer.

                                                        Method
                                                        We generated patient derived xenografts from 8 different colorectal cancer tumors. We sequenced exomes of these tumors, paired germline DNA and PDXs to identify somatic and germline mutations from 8 patients with colorectal cancer. We compared somatic mutations along with copy number alterations between the tumors and PDXs. We further applied copy number analyses along with somatic allele frequencies to infer tumor purity. The integration of allelic fraction and copy number information also helped us to identify tumor sub populations.

                                                        Results
                                                        We identified significant recurrent mutations in PI3K pathway gene PIK3CA, ERBB-RAS pathway gene NRAS and Wnt pathway genes TCF7L2 and APC. We found hotspot mutations in tumor suppressor gene TP53, transcriptional modifier gene SMAD2 in the patients and PDXs. We observed significant subclonal heterogeneity in frequently mutated genes in colorectal cancer both in patient tumors and PDXs.
                                                        Our study demonstrates that tumor-specific PDX models faithfully recapitulate the genetic heterogeneity and clonality in tumors and are viable models for targeted therapies.

                                                        OP08 (PT) - Visualization of TCGA RNA-Seq information with an user-friendly mobile application
                                                        Date: Sunday, July 10 10:10 am - 11:10 am
                                                        Room: America's Seminar
                                                        Theme: Sequence Analysis

                                                          Presentation Overview: Show

                                                          In recent years, rapid expansion of mobile devices, including smart phones and tablets, has created a new trend of personal computing. Personal mobile devices have become convenient devices for daily information retrieval and exchange with more freedom. However, few mobile applications (APPs) were created to retrieve and display genome annotation information on the tablets or smart phones. Currently, no bioinformatic related mobile applications have developed specifically for the visualization of large-scale NGS sequence data. With increasing computation and graphic display capacities of mobile devices, mobile devices and mobile applications would become suitable user-friendly platforms for interrogating large-scale bioinformatic and genomic data. Herein, we tried to develop mobile application software to demonstrate the feasibility of visualizing large-scale human cancer gene expression information. We have implemented an iOS mobile application (RNA-Seq Viewer) in order to visualize the Next Generation Sequencing gene expression information with over 2,500 human cancer patients retrieved from The Cancer Genome Atlas (TCGA). Users can select RNA-Seq data of any given individual sample from nine different cancer types and our mobile application could efficiently display whole transcriptome expression information systematically over a human chromosome framework with easy accessibility and intuitive navigation user interface. Local gene modulation patterns could be inspected thoroughly. In addition, users can visualize their own RNA-Seq data by building their customized dataset. We imagine such mobile applications could be utilized in future personalized medicine applications by serving as an underlying component to easily access the genomic and medical information using cloud infrastructure on various mobile devices.

                                                          OP09 (PT) - A bioinformatics tool to improve the efficacy of exon skipping therapy for Duchenne muscular dystrophy
                                                          Date: Sunday, July 10 10:10 am - 11:10 am
                                                          Room: America's Seminar
                                                          Theme: Bioinformatics of Disease and Treatment

                                                            Presentation Overview: Show

                                                            Duchenne muscular dystrophy (DMD) is a common and devastating genetic disease characterized by muscle wasting. Exon skipping uses small DNA-like molecules, antisense oligos (AOs), that act like stitches to modulate gene products and rescue the mutations. The efficacy of exon skipping at different target positions can vary by more than 20-fold, thus the selection of the target site could make the difference between success and failure of clinical trials. However, no effective method has been developed to choose the optimal target site. We propose to develop an in silico (computational) method, which is considered a fast, inexpensive, and effective way to guide the screening. We have recently developed such framework, and identified a "DNA-stitch" that is improved by more than 10 times compared to current clinical trial molecules. We wish to improve it further and identify new drug candidates that can treat a majority of DMD patients with various mutations.
                                                            We plan to pursue the following objectives: 1) to identify influential features in exon skipping, and use bioinformatics techniques to develop an efficient algorithm to predict the efficacy of exon skipping of AOs; 2) to improve the efficacy of both single- and multi-exon skipping, extend our framework to predict efficacy of multiple AOs, using a new algorithm that addresses interaction of random sets of oligos and RNAs. 3) to verify the correlation of predicted and actual efficacy of exon skipping in vitro and in vivo. 4) to launch the web software and incorporate community feedback to improve its quality.

                                                            OP10 (PT) - Epigenetic age predicted from H3K9ac ChIP-seq data is associated with Alzheimer's disease pathologies in the human prefrontal cortex
                                                            Date: Sunday, July 10 10:10 am - 11:10 am
                                                            Room: America's Seminar
                                                            Theme: Bioinformatics of Disease and Treatment

                                                              Presentation Overview: Show

                                                              Alzheimer's disease (AD) is a common neurodegenerative disease. Age is a known main risk factor for AD. We analyzed the epigenetic mark histone 3 lysine 9 acetylation (H3K9ac) in the human prefrontal cortex of 676 samples from the ROSMAP study. Participants were not cognitively impaired upon study entry. After death, AD pathologies including neurofibrillary tangles were measured and anti-H3K9ac ChIP-seq experiments were conducted. We identified 26384 H3K9ac domains in the ChIP-seq data. The numbers of sequence reads falling into each domain were determined for each sample, and normalized by regressing out technical nuisance variables.

                                                              We split the dataset into training (n=446) and test data (n=230). An L1 penalized regression model was fitted on the training data with age of death as outcome and H3K9ac domains as penalized explanatory variables. Gender was added as unpenalized covariate. The penalty parameter was determined by maximizing the cross-validated likelihood on the training set. The coefficients of 10 domains were unequal to 0. This model was used to predict the epigenetic age of the test samples. Predicted epigenetic age showed a moderate correlation of 0.25 with age of death. We defined accelerated aging as the residuals resulting from regressing epigenetic age on age of death and gender. Accelerated aging was positively associated with neurofibrillary tangles (p=0.022).

                                                              We further discuss accelerated aging in AD and limitations of our study. We also calculate accelerated aging based on DNA methylation from the same samples [Levine et al., 2015] and compare those estimations to the H3K9ac-derived estimations.

                                                              OP11 (PT) - ZIKV-CDB: A collaborative database to help understanding symptoms induced by ZIKA virus infection mediated by small noncoding RNAs
                                                              Date: Sunday, July 10 11:40 am - 12:40 pm
                                                              Room: America's Seminar
                                                              Theme: Bioinformatics of Disease and Treatment

                                                                Presentation Overview: Show

                                                                Zika virus (ZIKV) is an emerging mosquito-borne flavivirus, first isolated in 1947 from the serum of a pyrexial rhesus monkey caged in the Zika Forest (Uganda/Africa). In 2007 ZIKV was reported to be responsible for an outbreak of relatively mild disease on Yap Island in the western Pacific Ocean. In the past year, ZIKV has been circulating in the Americas, probably introduced through Easter Island (Chile), by French Polynesians. In early 2015, a new outbreak was recognized in northeast Brazil, where concerns over its possible links with infant microcephaly have been discussed. Providing a definitive link between ZIKV infection and birth defects is still a big challenge. Small noncoding RNAs (small ncRNA) play important roles in biological processes, mainly regulating post-transcriptional gene expression through mechanisms of translation repression and gene silencing. It is well known that some classes of small ncRNA are able to influence viral pathogenesis and brain development. The potential for flavivirus-mediated small ncRNA signaling dysfunction in brain-tissue development provides a compelling mechanism underlying perceived linked between ZIKV and microcephaly. A collaborative database called ZIKV-CDB has been assembled that could help target mechanistic investigations of this possible relationship between ZIKV symptoms and small ncRNA mediated human gene expression control, helping to foster potential targets for therapy. The database is under development, but already includes predicted miRNAs involved in ZIKV/human-host interaction, being available at http://zikadb.cpqrr.fiocruz.br.

                                                                OP12 (PT) - Metabolome wide association of DDT exposure in humans and in mice
                                                                Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                Room: America's Seminar
                                                                Theme: Systems Biology and Networks

                                                                  Presentation Overview: Show

                                                                  Environmental exposures contribute greatly to human health and disease, yet it has been difficult to quantify such impacts. The research in exposome and metabolomics, driven by high-resolution mass spectrometry, is now moving this frontier forward. The exposome aims to catalog internal doses or surrogates of all environmental exposures. The metabolome captures all small molecules, reflecting the biochemical state that serves as deep phenotyping and as the footprint of gene activities. These new data thus become the missing pillar in understanding gene-environment interactions. To illustrate this emerging paradigm, we use high-resolution metabolomics to study the effect of the pesticide DDT (dichlorodiphenyltrichloroethane) exposure in human population and in mouse models.
                                                                  Archived serum samples of 465 subjects in California from the 1960s, when DDT exposure was at its peak, were used for metabolomics analysis, using a Thermo Q-Exactive mass spectrometer coupled with reverse phase C18 liquid chromatography. The association of each metabolite feature to DDT was assessed by regression models, accounting for age, BMI and total blood lipids. This metabolome wide association study (MWAS) was followed by mummichog, our published algorithm for untargeted metabolomics, to perform metabolic pathway and network analysis. Similar analysis was carried out in mouse models, and confirmed the significant pathways detected in the human population, including the metabolism of arginine, aspartate, asparagine and fatty acids. This study demonstrates a new set of methodology for MWAS, and reveals the biological effects from DDT exposure.

                                                                  OP13 (PT) - Multiscale, multifactorial response network of immunization in humans
                                                                  Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                  Room: America's Seminar
                                                                  Theme: Systems Biology and Networks

                                                                    Presentation Overview: Show

                                                                    High-dimensional data are an important part of the tremendous recent growth of human immunology, which to a great extent, benefits from controlled longitudinal vaccination studies. We report here our development of a Multiscale, Multifactorial Response Network (MMRN) using data from a herpes zoster vaccine study in humans. Metabolomics, transcriptomics, cytokines and frequencies of cell subpopulations were measured multiple times at the beginning of the study, and antibody response was monitored up to 6 months. Dimension reduction was performed in two steps. E.g. the transcriptome was collapsed into our previously published blood transcription modules, and the modules were further grouped by network clustering techniques. Partial least square regression was used to assess the association between different data types, using permutation test. The resulting MMRN network revealed important temporal connections between cytokines, plasma metabolites, blood cell frequencies and gene expression. We demonstrate that the MMRN is highly accurate in predicting biological outcomes. These results also suggest a new paradigm that the gene expression in blood cells is guided by metabolite cues from the plasma.

                                                                    OP14 (PT) - Widespread misannotation of samples in genomics studies
                                                                    Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                    Room: America's Seminar
                                                                    Theme: Bioinformatics of Disease and Treatment

                                                                      Presentation Overview: Show

                                                                      Concern about the reproducibility and reliability of biomedical research has been rising. A bedrock principle of research conduct is that the samples analyzed are correctly identified and not mixed up during processing, but this has rarely been assessed formally.
                                                                      Here we studied the prevalence of sample misannotation in a large corpus of genomics studies by comparing meta-data annotations of sex to predictions from expression of sex-specific genes. We identified apparent misannotated samples in 46% of the datasets sampled. Extrapolating beyond our corpus, we estimate that at least 33% of all studies have at least one such mix-up (99% confidence interval). Because this method can only identify a subclass of potential misannotations, this provides a conservative estimate for the breadth of the problem. In an additional set of studies that used samples from the same subjects, 2/4 had misannotatated samples. These misannotations are likely to result from laboratory mix-ups rather than subject meta-data collection errors.
                                                                      Our findings emphasize the need for genomics researchers to implement more stringent sample tracking and data quality control steps, and suggests that re-use of published data should be done in conjunction with careful re-examination of meta-data.

                                                                      OP15 (PT) - Sample alignment: A critical QC step for integrative analysis using multi-omics data
                                                                      Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                      Room: America's Seminar
                                                                      Theme: Bioinformatics of Disease and Treatment

                                                                        Presentation Overview: Show

                                                                        Biological systems employ multiple levels of regulation that enable them to respond to genetic, epigenetic, genomic, and environmental perturbations. Advances in high-throughput technologies have generated comprehensive datasets measuring multiple aspects of biological regulations. Public databases, such as TCGA (The Cancer Genome Atlas), have been created for depositing diverse types of omics data for public dissemination. However, sample errors, such as sample-swapping or mis-labeling, are inevitable during the process of data generation and management. Because data errors could lead to wrong scientific conclusions, it is critical to properly match different types of omics data pertaining to the same individual before applying integrative analysis.
                                                                        We applied a systematic alignment method into TCGA datasets. For example, in the breast cancer dataset (BRCA) consisting of ~1000 samples, we detected multiple sample errors in different types of molecular data. In each type of data, about 3-8% of profiles were not consistent with the labels based on the sample barcodes (16 profiles in microarray, 4 in HM27, 18 in HM450, 9 in GAmiRNA, 84 in HiSeq-miRNA, 31 in CNV). Multi-omics alignments identified sample-swapping of the 16 samples in microarray and mis-labeling of the 8 miRNA samples. Errors in genders or labeling of samples were also observed in other cancer datasets in TCGA (such as glioblastoma, lung, prostate, stomach). These results suggest that sample errors are not a dataset specific problem but more global problem in public databases and, therefore, our approach will provide a critical QC step to clean data for integrative analysis using large-scale dataset.

                                                                        OP16 (PT) - Knowledge-guided fuzzy logic network modeling to infer cancer signaling pathways from time-series data
                                                                        Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                        Room: America's Seminar
                                                                        Theme: Systems Biology and Networks

                                                                          Presentation Overview: Show

                                                                          Computational modeling of signaling pathways is crucial for understanding carcinogenesis and predicting responses of cancer cells to drug treatments. However, canonical signaling pathways curated from the literature are seldom context-specific and thus can hardly make precise prediction of anti-cancer drug effects. Association-based data-driven methods have drawbacks such as limited interpretability about underlying mechanisms. Therefore, hybrid methods that integrate prior knowledge and real data for network inference are highly desirable. In this paper, we propose a knowledge-guided fuzzy logic network model to infer signaling pathways by exploiting both prior knowledge and time-series data. Dynamic time warping is adopted to measure the goodness of fit between experimental and predicted data, so that our method can model temporally-ordered experimental observations. Moreover, two regularizers are introduced to penalize the incompatibility of the model with prior knowledge and constrain the number of proteins interacting with each signaling protein. The knowledge-guided fuzzy logic network model is further converted to a constrained nonlinear integer programming problem that can be solved by a genetic algorithm. We evaluated the proposed method on a synthetic dataset and a real time-series phosphoproteomics dataset. The experimental results demonstrate that our model can effectively uncover drug-induced alterations in signaling pathways in cancer cells. Compared with existing hybrid models, we are able to model feedback loops so that the dynamical mechanisms of signaling networks can be uncovered from time-series data. By calibrating generic models of signaling pathways against real data, our method supports precise predictions of context-specific anticancer drug effects.

                                                                          OP17 (PT) - DECONVOLUTION OF CELL AND ENVIRONMENT SPECIFIC SIGNALS AND THEIR INTERACTIONS FROM COMPLEX MIXTURES IN BIOLOGICAL SAMPLES
                                                                          Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                          Room: America's Seminar
                                                                          Theme: Systems Biology and Networks

                                                                            Presentation Overview: Show

                                                                            Background :
                                                                            In many fields of science observations on a studied system represent complex mixtures of signals of various origin. Tumors are engulfed in a complex microenvironment (TME) that critically impacts progression and response to therapy. It includes tumor cells, fibroblasts, and a diversity of immune cells. It is known that under some assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction.

                                                                            Description :
                                                                            In this work, we apply independent components analysis (ICA) to decipher sources of signals shaping transcriptomes (global quantitative profiling of mRNA molecules) of tumor samples, with a particular focus on immune system-related signals. We use ICA iteratively decomposing signals into sub-signals that can be interpreted using pre-existing immune signatures through correlation or enrichment analysis.

                                                                            Results :
                                                                            Our analysis revealed a possibility to identify signals related to groups of immune cell types with unsupervised learning approach in a Breast Cancer dataset. Through Fisher exact test we identified significative groups corresponding to three out of five sub-signals: (1) T-cells, (2) DC/Macrophages, (3) Monocytes/ Macrophages/ Eosynophiles/Neutrophiles. T-cells metagene correlates well with the tumor grade (Kruskall-Wallis test p-value=0.003).

                                                                            Discussion :
                                                                            Ongoing analysis aims to evaluate the robustness of the represented groups and eventual differences between several types of cancer. We are to characterize the immune infiltration degree in the cancer transcriptome dataset and further correlate with patients’ survival and tumor characteristics. In the case of success, the results will be used in the diagnosis and cancer therapy, especially immunotherapies.

                                                                            OP18 (PT) - Comprehensive profiling of somatic mosaicism in the human brain
                                                                            Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                            Room: America's Seminar
                                                                            Theme: Genome Organization and Annotation

                                                                              Presentation Overview: Show

                                                                              As mounting evidence indicates, each cell in the human body has its own genome, a phenomenon called somatic mosaicism. Such somatic variations include single nucleotide variants (SNVs), small insertions and deletions (indels), transposable element insertions, large copy-number variations (CNVs), and structural variations. Although somatic mosaicism may pose functional and pathological implications, there has been no comprehensive estimate of the number and allelic frequency of genomic variations in normal somatic cells in various tissues of the human body, as it remains difficult to detect somatic mosaic variants given their limited presence in cell tissue—at times, amounting to less than a fraction of a percent. To circumvent that problem, we sequenced the genomes of clonal cell populations derived from single brain progenitor cells to identify genomic variations present in the founder cell and manifested in each clone at 50% allele frequency. Unlike single cell sequencing, our approach avoids amplification artifacts. For data analysis, we developed a workflow to synergize calls from several variant calling programs: MuTect, SomaticSniper, Strelka, and VarScan for SNVs; Scalpel, Strelka, and VarScan for indels; CNVnator for CNVs. By applying the workflow to compare germline genomes of different individuals, we performed a data-driven estimation of workflow sensitivity. Using real data for six clones from an individual healthy brain, we detected per clone 200–500 SNVs at >75% sensitivity, 10–30 indels at >40% sensitivity, and 1-5 CNVs . Orthogonal experimental validation revealed a ~100% specificity of the calls generated. Thus, our analysis has revealed extensive somatic mosaicism within the human brain.

                                                                              OP19 (PT) - BayesTyping: a Tool for HLA typing with PacBio CCS Reads
                                                                              Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                              Room: America's Seminar
                                                                              Theme: Sequence Analysis

                                                                                Presentation Overview: Show

                                                                                The human leukocyte antigen (HLA) gene family plays a critical role in biomedical aspects, including organ transplantation, autoimmune diseases and infectious diseases. Coupled with the fact that the gene family contains the most polymorphic genes in human, clinical applications and biomedical research require highly accurate HLA typing. Meanwhile, NGS data have proved the ability to achieve high resolution HLA typing; however, the reads of the most platforms are not long enough to cover the two sequential exons, i.e., exon 2 and exon 3, and would lead to phasing ambiguities. On the other hands, the long reads of the PacBio system could unequivocally solve the phasing problem. The advantage of the PacBio long reads could be compromised by the high error rates; therefore, we proposed a typing method, which adjusted the Bayes’ theorem so that it could tolerate sequencing errors as well as de-multiplexing errors. We have implemented the method and integrated the pipeline of HLA typing into a program named BayesTyping.

                                                                                OP20 (PT) - AODP: An improved method for signature oligonucleotide design
                                                                                Date: Sunday, July 10 11:40 am - 12:40 pm
                                                                                Room: America's Seminar
                                                                                Theme: Comparative Genomics

                                                                                  Presentation Overview: Show

                                                                                  High-throughput Next Generation Sequencing (NGS) technologies and reference databases have enhanced our ability to explore diversity at genetic and taxonomic levels. Most off-the-shelf tools for examining genetic diversity implement algorithms that rely on sequence similarity and composition, which can lead to resolution loss in genetic comparisons, particularly at the species/sub-species taxonomic ranks. We present a new version of the Automated Oligonucleotide Design Pipeline (AODP). AODP designs signature oligonucleotides (SO) with specificity and fidelity based on genome or DNA barcode sequence identity, reducing the resolution loss observed with existing approaches. SO designed with AODP highlight regions with taxon or clade-specific polymorphisms that are useful for comparative genomics and provide suitable candidates for the design of primers/probes in diagnostic assays. AODP has several unique features: 1) The AODP algorithm uses a novel packed-Trie data structure, with support for multi-threaded insertion, optimized for DNA nucleotide strings, which scales well to multi-processor architectures; 2) SO can be designed for a large dataset with relatively small memory footprint; 3) Regions of DNA with a single nucleotide polymorphism (SNP) can be optionally ignored to minimize noise caused by sequencing errors during NGS; 4) The specificity of SO can be further validated against large reference databases; 5) SO thermodynamic properties can be calculated for wet-lab experimental conditions; and 6) SO can be directly used for in silico identification of taxa from environmental NGS data.

                                                                                  OP21 (PT) - Tech Startups of the Genome: De novo genes arise frequently, try to be useful, and occasionally succeed and survive
                                                                                  Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                  Room: America's Seminar
                                                                                  Theme: Protein Structure and Function Prediction and Anal

                                                                                    Presentation Overview: Show

                                                                                    Most new protein-coding genes originate from old genes by duplication and domain shuffling. It was previously assumed that intergenic DNA could not yield long enough protein products through random mutations. Yet de novo protein-coding genes - derived from intergenic DNA - were recently found in multiple species. These genes are of particular interest as they alone can invent novel protein structures.
                                                                                    We asked how often de novo genes appear, how many exist in any genome and what proteins they make. We built a mathematical model incorporating gene dimensions and genome dynamic processes (mutation, recombination, selection). It predicts that de novo genes can easily be created and that at any time many young de novo genes exist, most being lost quickly. We identified thousands of de novo genes by phylostratigraphy in five genomes and analyzed their biophysical properties using structural bioinformatics. We found that, compared to ancient proteins, de novo proteins are shorter, more disordered, promiscuous (interacting with more proteins and DNA), vulnerable to proteases, and less prone to aggregation. Moreover, de novo proteins lack Pfam domains and may be structurally novel.
                                                                                    Frequent gene creation and reduced tendency towards aggregation (which is toxic) provides a steady-state population of young de novo genes in the genome. This, along with de novo proteins’ propensity to interact, increases the chance that some will use their novel structures (and possibly novel functionalities) to integrate into existing genetic networks and survive for a long evolutionary time.

                                                                                    OP22 (PT) - Substitution rate variability causes memory effects on protein sequence evolution and is triggered by sequence coevolution
                                                                                    Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                    Room: America's Seminar
                                                                                    Theme: Sequence Analysis

                                                                                      Presentation Overview: Show

                                                                                      The assumption of lack of memory, i.e. Markovianity, is common to many models of protein sequence evolution, in particular to those based on point accepted mutation matrices (Dayhoff et al., 1978). Nevertheless, it has been observed (Benner et al.,1994 and Mitchison and Durbin,1995) that evolution seems to proceed differently at different time scales, questioning the Markovian assumption. We show that the among-site variability of substitution rates introduces an effective memory that makes protein sequence evolution not Markovian: each site retains the `memory' of its own substitution rate and this influences both the local destiny of that site and the global destiny of the full sequence. We introduce a simple model that describes the occurrence of substitutions in a generic protein sequence, based on the idea that mutations are more likely to be accepted at sites that interact with a spot where a substitution has occurred in the recent past. The model therefore extends the usual assumptions made in protein coevolution by introducing a time dumping on the effect of a substitution. We validate this model by successfully predicting the correlation of substitutions as a function of their distance along the sequence. Despite its simplicity, this model predicts a distribution of substitution rates highly compatible with a gamma distribution, consistently with the common wisdom (Yang 1993, Yang et al. 1994).

                                                                                      OP23 (PT) - The evolutionary origin of orphan genes
                                                                                      Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                      Room: America's Seminar
                                                                                      Theme: Comparative Genomics

                                                                                        Presentation Overview: Show

                                                                                        Many of the most powerful tools in biology rely on inference of homologs via sequence-based algorithms. However, many loci are invisible to such methods. Those that are short or rapidly evolving, such as orphan genes and small non-coding RNAs, may yield no significant hits. Whereas low-complexity or high-copy number loci may hide in a crowd of false positives. Searching by context bypasses this problem. We present an algorithm for tracing loci between genomes using a synteny map, and test its efficacy by mapping all Arabidopsis thaliana-specific genes to the genomes of eight related species. By reducing the search space and winnowing false positives, we were able to assess the origin of the individual orphan genes with unprecedented resolution. We traced many to their non-genic cousins, identifying the non-genic footprint from which they arose. We linked others to putative genes in related species from which they diverged beyond recognition. Knowing the approximate location of each gene across species also provides a starting point for future studies. Our pipeline can easily be adapted to contextualize elusive elements such as small RNAs and lineage-specific genes in any species for which reliable synteny maps can be built.

                                                                                        OP24 (PT) - Tertiary Structural Propensities Reveal Basic Sequence-Structure Relationships in Proteins
                                                                                        Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                        Room: America's Seminar
                                                                                        Theme: Protein Structure and Function Prediction and Anal

                                                                                          Presentation Overview: Show

                                                                                          The Protein Data Bank (PDB) is a key resource of general principles that has shaped our understanding of protein structure. Most of the existing statistical generalizations of protein structures are made for secondary structures, which are often too generic to satisfy many specific design goals, or for protein domains, for which the PDB distribution is highly biased by evolution or human sampling, and thus not being physically meaningful. To fill this gap, we proposed the local tertiary motifs (TERMs) as a new fundamental level of structural unit. TERMs are combinations of non-continuous small secondary fragments connected by inter-residue contacts. We hypothesized that the PDB contains valuable quantitative information on the level of TERMs. We studied the propensities of TERMs within their corresponding ensembles, i.e. geometrically similar structural fragments from completely unrelated proteins. The TERM propensities are physically meaningful in many contexts. By breaking a protein structure into its constituent TERMs, we can evaluate the accuracy of structure-prediction models with poorly predicted regions identifiable, via a metric we named “structure score” capturing the sequence-structure relationships in TERMs. Also, querying TERMs affected by point mutations enables straightforward prediction of mutational free energies. Our performance exceeds or is comparable to state-of-art methods. Our results suggest that the data in the PDB are now sufficient to enable the quantification of complex structural features, such as those associated with entire TERMs. This should present opportunities for advances in computational structural biology techniques, including structure prediction and design.

                                                                                          OP25 (PT) - Natural language processing in text mining for protein docking
                                                                                          Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                          Room: America's Seminar
                                                                                          Theme: Protein Structure and Function Prediction and Anal

                                                                                            Presentation Overview: Show

                                                                                            High-throughput sequencing has become rapid and inexpensive, providing a vast amount of protein and DNA sequences for many genomes. The next challenge for biology is to use this information to gain fundamental insights into biomolecular mechanisms. One important direction towards this goal is structural reconstruction of the entire interactomes/biological pathways, with consecutive mapping of genetic variants/mutations onto corresponding structures. Due to inherent limitation of experimental techniques, most structures of protein-protein interactions (PPI) have to be computationally modeled (docked). Protein docking pipelines produce a large number of putative docking models. Identification of near-native models among them is a serious challenge. At the same time, a rapidly growing amount of publicly available information from biomedical research provides constraints on the binding mode, which can be essential for the docking. Recently, we have shown the potential of the basic text mining (TM) for protein docking (Badal VD, Kundrotas PJ, Vakser IA, PLoS Comput Biol, 2015, 11: e1004630). Here we present an extension of the TM tool, which utilizes natural language processing (NLP) to analyze residue-containing sentences and their surrounding in the retrieved PubMed abstracts. To generate sentence dependency tree, we utilized Stanford parser, and used inverse distances between PPI-relevant keywords and residues mentioned in the abstracts to discriminate the non-interface residues. We tested WordNet, dictionary look-up and deep parsing NLP approaches. The procedure was benchmarked on 579 X-ray bound structures of binary protein complexes and validated in docking of unbound protein structures from the DOCKGROUND resource (http://dockground.compbio.ku.edu).

                                                                                            OP26 (PT) - Protein Classification using Specific Domain Architectures
                                                                                            Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                            Room: America's Seminar
                                                                                            Theme: Protein Structure and Function Prediction and Anal

                                                                                              Presentation Overview: Show

                                                                                              Advances in genomic sequencing technology have drastically increased the amount of available sequence data, escalating the need for rapid annotation of genes and protein models. Recently, the Conserved Domain Database curation team has been developing an in house procedure, SPecific ARChitecture Labeling Engine (SPARCLE) to study the extent to which protein domain architecture can be utilized to define groups of proteins with similarities in molecular function and to derive corresponding functional characterization. So far, about 3, 000 common domain architectures from bacteria have been labelled and SPARCLE will be made available to the public as searchable resource. Currently, SPARCLE only considers best-scoring or top-ranked domain hits and is also hampered by imperfect domain annotation. To overcome some of these limitations, we propose an alternative computational procedure for defining clusters of functionally similar proteins that utilizes pre-computed domain annotation from each available source database (COGs, TIGRFAMs, Pfam, and NCBI-curated annotations) for grouping protein sequences, instead of the terse domain annotation currently employed by SPARCLE. This approach provides tunable fine-grained separation of domain architectures, and has been tested on multiple domain architecture families and several genomic datasets. The quality of the resulting classifications has been examined by curators and validated via analysis of the consistency and uniqueness of clusters. We will also discuss the limitations uncovered to date, and hope that this study will identify suitable approaches for both rapid and sustainable, but also increasingly accurate functional labeling of protein models predicted from genomic sequences.

                                                                                              OP27 (PT) - TRACE: Reconstructing trajectories of cell cycle evolution using single-cell mass cytometry data
                                                                                              Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                              Room: America's Seminar
                                                                                              Theme: Systems Biology and Networks

                                                                                                Presentation Overview: Show

                                                                                                As single-cell experimental approaches become increasingly popular, cell-to-cell heterogeneity has emerged as a key determinant factor contributing to variability in gene expression and signaling responses. Mass cytometry (CyTOF) is a new proteomic technology that enables the simultaneous quantification of dozens of proteins in thousands of individual cells. In the context of cancer research, recent applications of CyTOF include the characterization of inter- and intra-tumor heterogeneity and the identification of novel cell subpopulations. However, as already demonstrated for single-cell RNA-seq, the resulting measurements are largely influenced by confounding factors, such as the cell cycle and cell volume.
                                                                                                We present here TRACE, a novel computational approach to quantify this source of variability. TRACE first exploits a hybrid machine learning approach to classify single cells into discrete cell cycle phases according to measurements of established markers. Next, a metric embedding optimization technique creates a one-dimensional continuous marker that tracks biological pseudotime and individual cells are subsequently ordered according to this pseudotime marker. The resulting cell cycle trajectories across perturbation time points allow us to separate cell cycle effects from experimentally induced responses, enabling the direct comparison of signaling responses through cell cycle progression. Additionally we show that volume biases can be corrected using housekeeping gene measurements. Our approach, implemented in a simple and intuitive Graphical User Interface, was used to analyze data from various cell lines subject to different stimulations. In each case, TRACE was able to separate confounding effects from signaling responses, enabling the unbiased analysis of biological processes.

                                                                                                OP28 (PT) - Protein-level regulation drives coordinated gene functions
                                                                                                Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                                Room: America's Seminar
                                                                                                Theme: Systems Biology and Networks

                                                                                                  Presentation Overview: Show

                                                                                                  The relationships between gene expression, cellular functions, and disease phenotypes have been defined largely by transcriptome profiling. Transcriptomic studies rely explicitly or implicitly on the assumption that co-expressed mRNAs share similar biological functions, which guides common data analysis approaches, including gene clustering, co-expression network analysis, and gene set enrichment analysis. However, recent studies report only a moderate correlation between mRNA and protein profiles. Quantitative analysis of multi-level gene expression regulation is conceptually and technically challenging, and a key question — whether protein co-expression or mRNA co-expression better predicts gene co-functionality — remains largely unexplored. Here, we address this question in cancer using rich mRNA and protein profiling data from The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). We constructed mRNA and protein co-expression networks for three cancer types with matched mRNA and protein profiling data sets. The analyses revealed a marked difference between the wiring of the protein and mRNA co-expression networks. Whereas protein co-expression was driven primarily by functional similarity between co-expressed genes, mRNA co-expression was driven by both co-function and chromosomal co-localization of the genes. Protein-level regulation strengthened the link between gene expression and function for at least three quarters of Gene Ontology (GO) biological processes and ninety percent of KEGG pathways. A web application developed based on the three protein networks revealed novel gene-function relationships. Protein-level regulation provides essential mechanisms to drive coordinated gene functions. Elucidating these mechanisms requires proteomic measurements.

                                                                                                  OP29 (PT) - TEtranscripts: A package for including transposable elements in differential expression analysis of RNA-seq datasets
                                                                                                  Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                                  Room: America's Seminar
                                                                                                  Theme: Sequence Analysis

                                                                                                    Presentation Overview: Show

                                                                                                    Motivation: Most RNA-seq data analysis software packages
                                                                                                    are not designed to handle the complexities involved in
                                                                                                    properly apportioning short sequencing reads to highly repetitive
                                                                                                    regions of the genome. These regions are often occupied by
                                                                                                    transposable elements (TEs), which make up between 20-80%
                                                                                                    of eukaryotic genomes. They can contribute a substantial portion
                                                                                                    of transcriptomic and genomic sequence reads, but are typically
                                                                                                    ignored in most analyses.
                                                                                                    Results: Here we present a method and software package for
                                                                                                    including both gene- and TE-associated ambiguously mapped
                                                                                                    reads in differential expression analysis. Our method shows
                                                                                                    improved recovery of TE transcripts over other published
                                                                                                    expression analysis methods, in both synthetic data and
                                                                                                    qPCR/NanoString-validated published datasets.
                                                                                                    Availability: The source code, associated GTF files for TE
                                                                                                    annotation, and testing data are freely available at http://
                                                                                                    hammelllab.labsites.cshl.edu/software.

                                                                                                    OP30 (PT) - Supervised learning enables detection of duplicates in biological databases
                                                                                                    Date: Sunday, July 10 2:00 pm - 3:00 pm
                                                                                                    Room: America's Seminar
                                                                                                    Theme: Sequence Analysis

                                                                                                      Presentation Overview: Show

                                                                                                      Motivation: Duplication in biological sequence databases has persisted for 20 years. Duplicate records introduce redundancies to databases, delay biocuration processes, and undermine the accuracy of studies based on sequence analysis such as GC content and melting temperature. Rapid growth of data makes purely manual de-duplication nearly impossible, and existing automatic systems cannot detect duplicates as precisely as experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While a mature approach in other duplicate detection contexts, machine learning has seen only preliminary application in the large biological sequence databases.

                                                                                                      Results: We developed a supervised duplicate detection method, employing an over one million-pair expert curated dataset of duplicates across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed both binary and multi-class models. Both models achieve promising performance; the binary model had over 90% accuracy in all the 5 organisms while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. In particular, better measurement on sequences drives the performance.

                                                                                                      OP31 (PT) - BioStudies – a database of biological studies
                                                                                                      Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                      Room: America's Seminar
                                                                                                      Theme: Open Science and Citizen Science

                                                                                                        Presentation Overview: Show

                                                                                                        BioStudies is a new database at EBI that aims to the address current limitations within the traditional structured data archives available to scientists.

                                                                                                        It is able to accept and store data from new and emergent technology where data is produced in formats not supported by the current EBI data resources. Biostudies is also able to link to data in other databases, this is particularly advantageous in multiomic studies where data has been deposited in a number of repositories but with no central link to tie everything together. Due to the flexible nature of it’s data model, Biostudies is also able to store the supplementary data that is associated with publications.

                                                                                                        A simple tab-delimited text format, PAGE-TAB has been developed to enable the capture of all the information described. PAGE-TAB allows the submitter to describe files and external links associated with a study, organise information in hierarchies, and attach annotation as appropriate. Extra functionality can be added for specific purposes, such as a compound view in the ‘Data Infrastructure for Chemical Safety (diXa)’ project.

                                                                                                        Submissions from users can be submitted through a new online tool allowing the submitter the input of metadata, including data release date, direct upload of files, links to already deposited data and associated publication information. The tool enables users to maintain and edit their own Biostudies record.

                                                                                                        As of March 2016 BioStudies contains 578,167 studies that are free to browse, download and reuse. The user interface enables ontology-driven query expansion, enabling powerful searching across thousands of datasets.

                                                                                                        OP32 (PT) - Next generation genomic computing
                                                                                                        Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                        Room: America's Seminar
                                                                                                        Theme: Sequence Analysis

                                                                                                          Presentation Overview: Show

                                                                                                          Next-generation sequencing (NGS) technologies and data processing pipelines are rapidly and inexpensively providing increasingly numerous sequencing data and associated (epi)genomic features of many individual genomes in multiple biological and clinical conditions, generally made publicly available within well-curated repositories. Answers to fundamental biomedical problems are hidden in these data; yet, their efficient management and integrative processing is becoming the biggest and most important “big data” problem of mankind. Multi-sample processing of heterogeneous information can support data-driven discoveries and biomolecular sense making, such as discovering how heterogeneous genomic, transcriptomic and epigenomic features cooperate to characterize biomolecular functions; yet, it requires state-of-the-art “big data” computing strategies, with abstractions beyond commonly used tool capabilities.

                                                                                                          We recently proposed a new paradigm in NGS data management and processing by introducing an essential Genomic Data Model (GDM) using few general abstractions for genomic region data and associated experimental, biological and clinical metadata that guarantee interoperability between existing data formats. Leveraging on GDM, we developed a next-generation, high-level, declarative GenoMetric Query Language (GMQL) for genomics data; here, we demonstrate its usefulness, flexibility and simplicity of use through several biological query examples. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous samples; computational efficiency and high scalability are achieved by using parallel computing on clusters or public clouds. GDM and GMQL are applicable to federated repositories, and can be exploited to provide integrated access to curated data, made available by large consortia such as ENCODE, Epigenomics Roadmap, or TCGA, through user-friendly search services.

                                                                                                          OP33 (PT) - Disturbances of transcriptional networks in congenital heart disease
                                                                                                          Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                          Room: America's Seminar
                                                                                                          Theme: Epigenetics

                                                                                                            Presentation Overview: Show

                                                                                                            The most common form of congenital heart disease (CHD), namely ventricular septal defect (VSD), is a subfeature of Tetralogy of Fallot (TOF), which comprises the majority of cases of cyanotic CHD. The underlying causes of the bulk of CHDs are still unclear but most probably consist of a combination of genetic, epigenetic and environmental factors. DNA methylation is the most-widely studied epigenetic modification and, here, we present the first analysis of genome-wide DNA methylation data (MBD-seq) obtained from myocardial biopsies of TOF and VSD patients. We found clear methylation differences between cases and controls, and between patient groups. For TOF, we linked DNA methylation with genome-wide gene expression data (RNA-seq) and found a significant overlap for hypermethylated promoters and down-regulated genes, and vice versa. Interestingly, we found examples of methylation changes co-localized with novel, differential splicing events among sarcomeric genes. In addition to DNA methylation, short non-coding RNAs like microRNAs have been shown to play a role in gene silencing. Thus, we further analyzed genome-wide small RNA-seq data from TOF patients and controls. Subsequently, we combined the microRNA expression data with previously analyzed gene expression profiles. In summary, our data suggest DNA methylation and microRNAs likely contribute to the pathogenesis of CHD by modulating disease-specific gene expression profiles.

                                                                                                            OP34 (PT) - Hidden RNA Codes Revealed from in vivo RNA Structurome
                                                                                                            Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                            Room: America's Seminar
                                                                                                            Theme: Sequence Analysis

                                                                                                              Presentation Overview: Show

                                                                                                              RNA can fold into secondary and tertiary structures, which are important for regulation of gene expression. We recently developed a method to perform genome-wide RNA structure profiling in vivo employing high-throughput sequencing techniques, and applied this methodology to Arabidopsis. This method makes it possible to probe thousands of RNA structures at one time in living cells. Hidden RNA codes have been revealed by bioinformatic analyses of our RNA structuromes including RNA structures related to alternative polyadenylation and splicing [1].
                                                                                                              Recently, further analysis of this dataset revealed a correlation between mRNA structure and the encoded protein structure, wherein the regions of individual mRNAs that code for protein domains generally have significantly higher structural reactivity than regions that encode protein domain junctions. This relationship is prominent for proteins annotated for catalytic activity but is reversed in proteins annotated for binding and transcription regulatory activity. We also found that mRNA segments that code for ordered regions have significantly higher structural reactivity than those that encode disordered regions [2].
                                                                                                              We also developed a new computational platform, StructureFold, to facilitate the analysis of high throughput RNA structure profiling data. As a component of the Galaxy platform (https://usegalaxy.org), StructureFold integrates four computational modules in a user-friendly web-based interface or via local installation [3].


                                                                                                              [1] Ding Y, Tang Y, Kwok CK, Zhang Y, Bevilacqua PC, Assmann SM. Nature. 2014;505:696-700.
                                                                                                              [2] Tang Y, Assmann SM, Bevilacqua PC. J Mol Biol. 2016;428:758-766.
                                                                                                              [3] Tang Y, Bouvier E, Kwok CK, Ding Y, Nekrutenko A, Bevilacqua PC, Assmann SM. Bioinformatics. 2015;31:2668-75.

                                                                                                              OP35 (PT) - To identify and functional validate of novel long intergenic noncoding RNAs in myogenesis using integrated genomic approach
                                                                                                              Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                              Room: America's Seminar
                                                                                                              Theme: Sequence Analysis

                                                                                                                Presentation Overview: Show

                                                                                                                Long intergenic noncoding RNAs (lincRNA) are a novel class of regulator that play important roles in many biological processes. Myogenesis is the formation of muscular tissue, particularly during embryonic development. Little is known how lincRNAs are involved in skeletal myogenesis. First, to identify the functional lincRNAs in myogenesis, we present a novel computational framework that can accurately identify potential functional lincRNAs from millions of assembly transcripts obtained from transcriptome sequencing data during myogenesis. Second, among many identified potential functional lincRNAs, we functionally validate a novel Linc-YY1 from the promoter of the transcription factor (TF) Yin Yang 1 (YY1) gene. We demonstrate that Linc-YY1 is dynamically regulated during myogenesis in vitro and in vivo. Gain or loss of function of Linc-YY1 in C2C12 myoblasts or muscle satellite cells alters myogenic differentiation and in injured muscles has an impact on the course of regeneration. Linc-YY1 interacts with YY1 through its middle domain, to evict YY1/Polycomb repressive complex (PRC2) from target promoters, thus activating the gene expression in-trans. Altogether, we show that Linc-YY1 regulates skeletal myogenesis and uncover a previously unappreciated mechanism of gene regulation by lincRNA.

                                                                                                                The work described here is substantially supported by General Research Funds (GRF) and Collaborative Research Fund (CRF) from the Research Grants Council (RGC) of the Hong Kong Special Administrative Region, China 476113, 473713, 14116014, 14113514 and C6015-14G

                                                                                                                OP36 (PT) - Comparison of normalization methods for single cell RNA-seq
                                                                                                                Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                                Room: America's Seminar
                                                                                                                Theme: Sequence Analysis

                                                                                                                  Presentation Overview: Show

                                                                                                                  Single cell RNA-seq has been widely used in biological studies. Removing technical noise and normalizing the sequencing data are critical to fully explore the power of this technology. Various methods have been developed for normalization, including FPKM, UQ, DeSeq, RUV, and GRM. Among all, RUV and GRM can use spike-in ERCC to calibrate the technical noise. It is urgent to assess the performance of these methods using data with ground truth.
                                                                                                                  Recently, the NIH Single Cell Analysis Program – Transcriptome Project generated a RNA-seq data set using different amount of RNAs (10pg, 100pg and bulk) with ERCC. These data provide an unprecedented opportunity to compare different methods using the same data set. After normalization using each method, we clustered the samples and assume bulk samples are most similar to each other, 100pg samples are more similar to bulk than 10pg samples, and 10pg samples are more diverse. We used different metrics to evaluate the clustering performance by statistical indice.
                                                                                                                  The results showed, for methods not using ERCC, UQ, DESeq and RUV have comparable performance and better than FPKM. Considering ERCC by RUV and GRM significantly outperformed these methods without ERCC. Between RUV and GRM, GRM is more robust subject to different sets of genes.
                                                                                                                  In summary, we presented the first systematic comparison of normalization methods for single cell RNA-seq. We found that considering ERCC is helpful to remove technical noise and drastically improves clustering results. This study provides a guidance of selecting normalization methods for analyzing single cell RNA-seq data.

                                                                                                                  OP37 (PT) - Through a glass darkly: Single-cell co-expression as seen through the Gene Ontology
                                                                                                                  Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                                  Room: America's Seminar
                                                                                                                  Theme: Functional Genomics

                                                                                                                    Presentation Overview: Show

                                                                                                                    Co-expression networks have been a useful tool for functional genomics, providing important clues about the cellular and biochemical mechanisms that are active in normal and disease processes. With the recent advances in single cell RNA-seq technology, it is now possible to zoom in to identify pathways at single cell resolution. We performed the first major analysis of single cell co-expression, sampling from 31 individual studies comprising 28799 samples from 163 cell-types. Data from 163 bulk RNA-seq experiments were used as an external control. Using neighbor voting in cross-validation, we found that single cell network connectivity is less likely to overlap with known gene ontology functions than co-expression derived from bulk RNA-seq (aggregate sc AUROC=0.68, aggregate bulk AUROC=0.73), which can be attributed to the preferential occurrence of expression drop-outs in single cell data. Strikingly, we discovered that functional variation within celltypes strongly resembles variation occurring across celltypes (rs~0.95). The lack of additional variation within celltypes suggests that current knowledge in GO cannot readily identify functions occurring in a celltype-specific manner, and that systematic mining of single cell data may be required to define novel pathways.

                                                                                                                    OP38 (PT) - The SNPs associated with protein-drug binding sites
                                                                                                                    Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                                    Room: America's Seminar
                                                                                                                    Theme: Genetic Variation Analysis

                                                                                                                      Presentation Overview: Show

                                                                                                                      The presence of SNPs on ligand-binding sites often have important functional consequences, leading to pathogenicity and variation in drug response. Understanding how SNPs may alter the efficacy and metabolism of certain drugs is crucial for successful implementation of the precision medicine model.
                                                                                                                      We review 136 unique protein-drug complexes and analyze the non-synonymous SNPs present in the drug-binding sites and the proximal residues. About 90% of these proteins have SNPs associated with less than 45% of their binding residues. In total, 2664 unique SNPs (2563 missense and 101 stop-gain mutations) are mapped. The frequency or clinical significance data is available for only 25.49% of these SNPs. Most show very low minor allele frequency in the populations and are associated with pathogenicity or drug response. Only two of the SNPs are found to be present in the GWAS catalogue. For the rest of the SNPs, online tools are used to predict the functional effects and conservation. We also analyze the SNP containing amino acids and the mutations that show significant differences between the binding residues and the rest of the protein sequences. Moreover, the protein-drug complexes with significant differences in presence of SNPs on binding sites are separately investigated.
                                                                                                                      This study is an effort towards understanding the possible effects of SNPs on drug response. We have comprehensively analyzed the association of SNPs with drug-binding sites and also highlighted the gaps in current knowledge.

                                                                                                                      OP39 (PT) - Detection and genotyping of short indels using sequence data from multiple samples
                                                                                                                      Date: Sunday, July 10 3:30 pm - 4:30 pm
                                                                                                                      Room: America's Seminar
                                                                                                                      Theme: Genetic Variation Analysis

                                                                                                                        Presentation Overview: Show

                                                                                                                        Short insertions and deletions (indels) are the second most common type of variation in the human genome. Despite tremendous advances in high-throughput sequencing technologies and computational methods for variant calling from DNA sequence data, accurate detection of indels remains a challenge. Some of the reasons for this difficulty include over-representation of short indels in regions of low sequence complexity, variability in indel error rates across different platforms as well as the lack of good error models for indels.

                                                                                                                        We have developed an EM algorithm for the detection and genotyping of short indels using aligned sequence reads from multiple individuals. Our probabilistic method models sequence context-specific error rates to estimate the posterior probability of a variant and genotypes. Modeling such error rates is particularly important for indel detection in homopolymer regions.

                                                                                                                        Using extensive simulations, we assessed the power of our EM algorithm to detect indels as a function of read depth, population allele frequency and indel error rates. Our method was significantly more accurate than the recently proposed population-based method SOAP-popIndel. We subsequently performed a comprehensive comparison of our method against a number of leading variant calling methods including GATK Haplotype-Caller, FreeBayes and Platypus, using exome data from the 1000 Genomes Project. Our algorithm is shown to have high sensitivity and low false positive rate compared to the other methods. We further demonstrate that our population-based approach enables the discovery of indels that would be impossible to call using individual data.

                                                                                                                        Rost Award (PT) - ISCB Outstanding Contributions Award - Recipient: Burkhard Rost
                                                                                                                        Date: TBA
                                                                                                                        Room: TBA
                                                                                                                        Theme:

                                                                                                                          Presentation Overview: Show

                                                                                                                          SST02- Part B (PT) - Trends and Methods in Genomic Data Compression.
                                                                                                                          Date: Monday, July 11th 10:10 - 10:30 a.m.
                                                                                                                          Room: BCD
                                                                                                                          Theme:

                                                                                                                            Presentation Overview: Show

                                                                                                                            SST02- Part C (PT) - Meaningful Data Compression and Reduction of High-Throughput Sequencing Data.
                                                                                                                            Date: Monday, July 11th 10:50 am - 11:10 am
                                                                                                                            Room: BCD
                                                                                                                            Theme:

                                                                                                                              Presentation Overview: Show

                                                                                                                              SST02- Part D (PT) - Compressive Structural Bioinformatics: High Efficiency 3D Structure Compression.
                                                                                                                              Date: Monday, July 11th 11:40 am - 12:00 pm
                                                                                                                              Room: BCD
                                                                                                                              Theme:

                                                                                                                                Presentation Overview: Show

                                                                                                                                SST02- Part E (PT) - Theoretical Foundations and Software Infrastructure for Biological Network Databases.
                                                                                                                                Date: Monday, July 11th 12:00 pm - 12:20 pm
                                                                                                                                Room: BCD
                                                                                                                                Theme:

                                                                                                                                  Presentation Overview: Show

                                                                                                                                  SST02- Part F (PT) - Task-Specific Compression for Biomedical Big Data.
                                                                                                                                  Date: Monday, July 11th 12:20 pm - 12:40 pm
                                                                                                                                  Room: BCD
                                                                                                                                  Theme:

                                                                                                                                    Presentation Overview: Show

                                                                                                                                    SST03 A (PT) - Genomic big data management and the GenoMetric Query Language
                                                                                                                                    Date: Tuesday, July 12 10:30 am - 10:50 am
                                                                                                                                    Room: BCD
                                                                                                                                    Theme:

                                                                                                                                      Presentation Overview: Show

                                                                                                                                      SST03 B (PT) - TCGA2BED and CAMUR for cancer NGS data processing
                                                                                                                                      Date: Tuesday, July 12 10:50 am - 11:10 am
                                                                                                                                      Room: BCD
                                                                                                                                      Theme:

                                                                                                                                        Presentation Overview: Show

                                                                                                                                        SST03 C (PT) - Searching patterns in genomic feature regions
                                                                                                                                        Date: Tuesday, July 12 10:50 am - 11:10 am
                                                                                                                                        Room: BCD
                                                                                                                                        Theme:

                                                                                                                                          Presentation Overview: Show

                                                                                                                                          SST03 E (PT) - Semi-automated human genome annotation using chromatin data
                                                                                                                                          Date: Tuesday, July 12 12:00 pm - 12:20 pm
                                                                                                                                          Room: BCD
                                                                                                                                          Theme:

                                                                                                                                            Presentation Overview: Show

                                                                                                                                            SST04 A (PT) - Biological Basis for Modeling Bacterial Communities
                                                                                                                                            Date: Tuesday, July 12 2:20 pm - 2:40 pm
                                                                                                                                            Room: BCD
                                                                                                                                            Theme:

                                                                                                                                              Presentation Overview: Show

                                                                                                                                              SST04 B (PT) - Molecular Tweeting: Bacteria Network Formation, Dynamics, and Control with Healthcare Applications
                                                                                                                                              Date: Tuesday, July 12 2:40 pm - 3:00 pm
                                                                                                                                              Room: BCD
                                                                                                                                              Theme:

                                                                                                                                                Presentation Overview: Show

                                                                                                                                                SST04 C (PT) - Data-Driven Modeling and In Silico Simulation of Cell Signaling Pathways
                                                                                                                                                Date: Tuesday, July 12 3:30 pm - 3:50 pm
                                                                                                                                                Room: BCD
                                                                                                                                                Theme:

                                                                                                                                                  Presentation Overview: Show

                                                                                                                                                  SST04 D (PT) - On Scaling Graph Algorithms for Microbiome Applications
                                                                                                                                                  Date: Tuesday, July 12 3:50 pm - 4:10 pm
                                                                                                                                                  Room: BCD
                                                                                                                                                  Theme:

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    TP001 (PT) - Robust Detection of Alternative Splicing in a Population of Single Cells
                                                                                                                                                    Date: Sunday, July 10 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Joshua Welch, UNC Chapel Hill, United States
                                                                                                                                                    • Yin Hu, Sage Bionetworks, United States
                                                                                                                                                    • Jan Prins, UNC Chapel Hill, United States

                                                                                                                                                    Area Session Chair: Ioannis Xenarios

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Single cell RNA-seq data promises to be an invaluable tool for characterizing cellular heterogeneity, but study of alternative splicing in single cells has been limited by the unique challenges of single cell data and lack of suitable analysis methods. We present SingleSplice, which is to our knowledge the first algorithm for identifying alternative splicing in a population of single cells. SingleSplice uses a statistical model trained on the technical noise profile of synthetic spike-in transcripts to identify genes exhibiting biological variation in isoform composition. We applied SingleSplice to data from 279 mouse embryonic stem cells and discovered genes that show significant alternative splicing across the set of cells. A subset of these genes are linked to cell cycle stage, suggesting a novel connection between alternative splicing and the cell cycle. Using SingleSplice, we also characterized the isoform usage heterogeneity of 466 adult and fetal human cortical cells.

                                                                                                                                                    TP002 (PT) - DFLpred: High throughput prediction of disordered flexible linker regions in protein sequences
                                                                                                                                                    Date: Sunday, July 10 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Fanchi Meng, University of Alberta, Canada
                                                                                                                                                    • Lukasz Kurgan, Virginia Commonwealth University, United States

                                                                                                                                                    Area Session Chair: Lenore Cowen

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Disordered flexible linkers (DFLs) are disordered regions that serve as flexible linkers/spacers in multi-domain proteins or between structured constituents in domains. They are different from flexible linkers/residues since they are disordered and longer. Availability of experimentally annotated DFLs provides an opportunity to build high-throughput computational predictors of these regions from protein sequences. To date, there are no computational methods that directly predict DFLs and they can be found only indirectly by filtering predicted flexible residues with predictions of disorder.
                                                                                                                                                    Results: We conceptualized, developed and empirically assessed a first-of-its-kind sequence-based predictor of DFLs, DFLpred. This method outputs propensity to form DFLs for each residue in the input sequence. DFLpred uses a small set of empirically selected features that quantify propensities to form certain secondary structures, disordered regions and structured regions, which are processed by a fast linear model. Our high-throughput predictor can be used on the whole-proteome scale; it needs < 1 hour to predict entire proteome on a single CPU. When assessed on an independent test dataset with low sequence-identity proteins, it secures area under the ROC curve (AUC) equal 0.715 and outperforms existing alternatives that include methods for the prediction of flexible linkers, flexible residues, intrinsically disordered residues, and various combinations of these methods. Prediction on the complete human proteome reveals that about 10% of proteins have a large content of over 30% DFL residues. We also estimate that about 6000 DFL regions are long with 30 or more consecutive residues.
                                                                                                                                                    Availability: http://biomine.ece.ualberta.ca/DFLpred/.

                                                                                                                                                    TP003 (PT) - FUNCTIONALLY PROFILING METAGENOMES AND METATRANSCRIPTOMES AT SPECIES-LEVEL RESOLUTION
                                                                                                                                                    Date: Sunday, July 10 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Eric Franzosa, Harvard T. H. Chan School of Public Health, United States
                                                                                                                                                    • Lauren McIver, Harvard T. H. Chan School of Public Health, United States
                                                                                                                                                    • Gholamali Rahnavard, Harvard T. H. Chan School of Public Health, United States
                                                                                                                                                    • George Weingart, Harvard T. H. Chan School of Public Health, United States
                                                                                                                                                    • Karen Schwarzberg, Northern Arizona University, United States
                                                                                                                                                    • Luke Thompson, University of Colorado at Boulder, United States
                                                                                                                                                    • Rob Knight, University of California San Diego, United States
                                                                                                                                                    • J. Gregory Caporaso, Northern Arizona University, United States
                                                                                                                                                    • Nicola Segata, University of Trento, Italy
                                                                                                                                                    • Curtis Huttenhower, Harvard T. H. Chan School of Public Health, United States

                                                                                                                                                    Area Session Chair: Alex Bateman

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Profiling microbial community function typically involves mapping millions of metagenomic or metatranscriptomic (“meta’omic”) sequencing reads against comprehensive reference sequence databases, often by translated search. In addition to being time-consuming and error-prone, this approach only provides an aggregate profile for a community, thus obscuring the contributions of individual species. To address these challenges, we designed a new tiered strategy for meta’omic functional profiling (HUMAnN2). Our method 1) rapidly identifies the species in a meta’omic sample, 2) maps sequencing reads to a sample-specific database constructed from those species’ pangenomes, and 3) only falls back to translated search for unclassified reads. In evaluations using synthetic data, HUMAnN2’s predicted functional profiles were 87% accurate at the community level (vs. 33% for pure translated search), and 79 to 91% accurate at the level of individual species. We applied HUMAnN2 to identify conserved metabolic pathways among 921 metagenomes from the Human Microbiome Project. In this task, HUMAnN2 tended to explain the majority of sample reads 10x faster than traditional search methods, thus saving 1,000s of CPU hours of compute time. Moreover, by highlighting individual species’ functional contributions, HUMAnN2 revealed new ecological patterns of functional conservation in the human microbiome (e.g. conserved metabolic pathways contributed by different species in different individuals). We expect our improvements to the performance and resolution of meta’omic functional profiling to be broadly applicable to analyses of host- and environmentally-associated microbial communities. HUMAnN2 is open-source, fully documented, and available for download now from http://huttenhower.sph.harvard.edu/humann2.

                                                                                                                                                    TP004 (PT) - Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects
                                                                                                                                                    Date: Sunday, July 10 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Florian Buettner, EMBL-EBI, United Kingdom
                                                                                                                                                    • John C. Marioni, EMB-EBI, United Kingdom
                                                                                                                                                    • Oliver Stegle, EMBL-EBI, United Kingdom

                                                                                                                                                    Area Session Chair: Ioannis Xenarios

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Single-cell RNA-sequencing (scRNA-seq) allows heterogeneity in gene expression levels to be studied in large populations of cells. However, such heterogeneity can arise due to both technical and biological factors, thus making decomposing sources of variation extremely difficult. Current methods to dissect this heterogeneity have critical limitations as they do not scale to large datasets comprising tens of thousands of cells and in particular do not permit joint modelling of the effects of biological factors and additional unknown and confounding sources of variation. We here describe a computationally efficient model that uses latent factors to jointly infer both biological and confounding sources of gene expression variation. We validate the method using simulations, demonstrating both its accuracy and its ability to scale to large datasets with up to 100,000 cells. Moreover, through applicationmodel to the largest single-cell RNA-seq study generated to date, consisting of 49,300 retina cells, we show that our model can robustly decompose scRNA-seq datasets into interpretable components as well as facilitating the identification of novel sub-populations.

                                                                                                                                                    TP005 (PT) - Unexpected Features of the Dark Proteome
                                                                                                                                                    Date: Sunday, July 10 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Nelson Perdigão, Universidade de Lisboa, Portugal
                                                                                                                                                    • Julian Heinrich, CSIRO, Australia
                                                                                                                                                    • Christian Stolte, CSIRO, Australia
                                                                                                                                                    • Kenneth Sabir, Garvan Institute of Medical Research, Australia
                                                                                                                                                    • Michael Buckley, CSIRO, Australia
                                                                                                                                                    • Bruce Tabor, CSIRO, Australia
                                                                                                                                                    • Beth Signal, Garvan Institute of Medical Research, Australia
                                                                                                                                                    • Brian Gloss, Garvan Institute of Medical Research, Australia
                                                                                                                                                    • Christopher Hammang, Garvan Institute of Medical Research, Australia
                                                                                                                                                    • Burkhard Rost, Technische Universität München, Germany
                                                                                                                                                    • Andrea Schafferhans, Technische Universität München, Germany
                                                                                                                                                    • Sean O'Donoghue, CSIRO & Garvan Institute, Australia

                                                                                                                                                    Area Session Chair: Lenore Cowen

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    We surveyed the "dark" proteome - that is, regions of proteins never observed by experimental structure determination and inaccessible to homology modeling. For 546,000 Swiss-Prot proteins, we found that 44-54% of the proteome in eukaryotes and viruses was dark, compared with only 14% in archaea and bacteria. Surprisingly, most of the dark proteome could not be accounted for by conventional explanations, such as intrinsic disorder or transmembrane regions. Nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. Dark proteins fulfill a wide variety of functions, but a subset showed distinct and largely unexpected features, such as association with secretion, specific tissues, the endoplasmic reticulum, disulfide bonding, and proteolytic cleavage. Dark proteins also had short sequence length, low evolutionary reuse, and few known interactions with other proteins. These results suggest new research directions in structural and computational biology.

                                                                                                                                                    TP006 (PT) - Integrating very large multi'omics data by hierarchical all-against-all association testing
                                                                                                                                                    Date: Sunday, July 10 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: SYSTEMS
                                                                                                                                                    • Gholamali Rahnavard, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
                                                                                                                                                    • Eric A. Franzosa, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
                                                                                                                                                    • Lauren J. McIver, Harvard T.H. Chan School of Public Health, United States
                                                                                                                                                    • George Weingart, Harvard T.H. Chan School of Public Health, United States
                                                                                                                                                    • Emma Schwager, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
                                                                                                                                                    • Yo Sup Moon, Harvard T.H. Chan School of Public Health, United States
                                                                                                                                                    • Xochitl C. Morgan, Harvard T.H. Chan School of Public Health, United States
                                                                                                                                                    • Levi Waldron, City University of New York School of Public Health, Hunter College, United States
                                                                                                                                                    • Curtis Huttenhower, The Broad Institute, Harvard T.H. Chan School of Public Health, United States

                                                                                                                                                    Area Session Chair: Alex Bateman

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Modern multi’omic screens of biological samples readily produce enormous numbers of measurements, yet finding statistically significant association patterns among features within these data remains challenging, in part due to the loss of statistical power inherent with testing large numbers of hypotheses. Here, we present and validate a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for general purpose and well-powered association discovery in high-dimensional heterogeneous datasets. HAllA combines hierarchical nonparametric hypothesis testing with false discovery rate correction to enable high-sensitivity discovery of linear and non-linear associations in high-dimensional datasets (which may be categorical, continuous, or mixed). HAllA operates by 1) discretizing data to a unified representation, 2) hierarchically clustering paired high-dimensional datasets, 3) applying dimensionality reduction to boost power and potentially improve signal-to-noise ratio, and 4) iteratively testing associations between blocks of progressively more related features. We validated and optimized HAllA using synthetic datasets of known correlation structure. At a fixed false discovery rate, HAllA is consistently better-powered than naive all-against-all association testing across a range of association types. As an example application, we used HAllA to identify associations between high-throughput profiles of microbial genera and metabolites of the human gut microbiome. In addition to recapitulating known associations, we identified 60 previously unobserved associations, including between Ruminococcus and Lithocholic acid. Our implementation of HAllA is highly modular, enabling addition or substitution of alternative methods at each step, and is available with documention at http://huttenhower.sph.harvard.edu/halla.

                                                                                                                                                    TP007 (PT) - Lightweight transcriptomics
                                                                                                                                                    Date: Sunday, July 10 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Surojit Biswas, Harvard University, United States
                                                                                                                                                    • Konstantin Kerner, Sainsbury Laboratory Cambridge University, Germany
                                                                                                                                                    • Sandra Cortijo, Sainsbury Laboratory Cambridge University, United Kingdom
                                                                                                                                                    • Varodom Charoensawan, Sainsbury Laboratory Cambridge University, United Kingdom
                                                                                                                                                    • Vladimir Jojic, UNC-Chapel Hill, United States
                                                                                                                                                    • Philip Wigge, Sainsbury Laboratory Cambridge University, United Kingdom

                                                                                                                                                    Area Session Chair: Ioannis Xenarios

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Transcript levels are critical determinant of the proteome and hence cellular function. Because the transcriptome is an outcome of the interactions between genes and their products, we reasoned it may be accurately represented by a subset of transcript abundances. By analyzing thousands of publicly available RNA-Seq datasets, we show that the transcriptomes of A. thaliana and M. musculus are highly compressible. Capitalizing on this observation, we develop a method, Tradict, to reconstruct the expression of globally representative biological processes or the entire transcriptome with the abundances of a small, machine-learned subset of 100 transcripts. These findings suggest natural improvements to both the time and cost of performing forward genetic and small molecule drug screens, mapping eQTLs in natural populations, identifying tumor subtypes, and accurately profiling individual single-cell transcriptomes at scale.

                                                                                                                                                    TP008 (PT) - Widespread Expansion of Protein Interaction Capabilities by Alternative Splicing
                                                                                                                                                    Date: Sunday, July 10 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Xinping Yang, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Jasmin Coulombe-Huntington, McGill University, Canada
                                                                                                                                                    • Shuli Kang, University of California, San Diego, United States
                                                                                                                                                    • Gloria M. Sheynkman, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Tong Hao, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Aaron Richardson, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Song Sun, University of Toronto, Canada
                                                                                                                                                    • Fan Yang, University of Toronto, Canada
                                                                                                                                                    • Yun A. Shen, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Ryan R. Murray, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Kerstin Spirohn, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Bridget E. Begg, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Miquel Duran-Frigola, Institute for Research in Biomedicine (IRB Barcelona), Spain
                                                                                                                                                    • Andrew MacWilliams, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Samuel J. Pevzner, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Quan Zhong, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Shelly A. Trigg, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Stanley Tam, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Lila Ghamsari, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Nidhi Sahni, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Song Yi, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Maria D. Rodriguez, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Dawit Balcha, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Guihong Tan, University of Toronto, Canada
                                                                                                                                                    • Michael Costanzo, University of Toronto, Canada
                                                                                                                                                    • Brenda Andrews, University of Toronto, Canada
                                                                                                                                                    • Charles Boone, University of Toronto, Canada
                                                                                                                                                    • Xianghong J. Zhou, University of Southern California, United States
                                                                                                                                                    • Kourosh Salehi-Ashtiani, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Benoit Charloteaux, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Alyce A. Chen, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Michael A. Calderwood, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Patrick Aloy, Institute for Research in Biomedicine (IRB Barcelona), Spain
                                                                                                                                                    • Frederick P. Roth, University of Toronto, Canada
                                                                                                                                                    • David E. Hill, Dana-Farber Cancer Institute, United States
                                                                                                                                                    • Lilia M. Iakoucheva, University of California, San Diego, United States
                                                                                                                                                    • Yu Xia, McGill University, Canada
                                                                                                                                                    • Marc Vidal, Dana-Farber Cancer Institute, United States

                                                                                                                                                    Area Session Chair: Lenore Cowen

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    While alternative splicing is known to diversify the functional characteristics of some genes, the extent to which protein isoforms globally contribute to functional complexity on a proteomic scale remains unknown. To address this systematically, we cloned full-length open reading frames of alternatively spliced transcripts for a large number of human genes, and combined protein-protein interaction profiling with computer modeling to functionally compare hundreds of protein isoform pairs. The majority of isoform pairs share less than 50% of their interactions. In the global context of interactome network maps, alternative isoforms tend to behave like distinct proteins rather than minor variants of each other. Interaction partners specific to alternative isoforms tend to be expressed in a highly tissue-specific manner and belong to distinct functional modules. Our integrated experimental and computational strategy reveals a widespread expansion of protein interaction capabilities through alternative splicing and suggests that many alternative isoforms are functionally divergent.

                                                                                                                                                    TP009 (PT) - Single molecule-level characterization of bacterial epigenomes, heterogeneity and gene regulation
                                                                                                                                                    Date: Sunday, July 10 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • John Beaulaurier, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Xue-Song Zhang, New York University Medical School, United States
                                                                                                                                                    • Shijia Zhu, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Robert Sebra, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Chaggai Rosenbluh, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Gintaras Deikus, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Nan Shen, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Diana Munera, Harvard Medical School, United States
                                                                                                                                                    • Matthew Waldor, Harvard Medical School, United States
                                                                                                                                                    • Andrew Chess, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Martin Blaser, New York University Medical School, United States
                                                                                                                                                    • Eric Schadt, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Gang Fang, Icahn School of Medicine at Mount Sinai, United States

                                                                                                                                                    Area Session Chair: Alex Bateman

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Beyond its role in host defense, bacterial DNA methylation also plays important roles in the regulation of gene expression, virulence and antibiotic resistance. Bacterial cells in a clonal population can generate epigenetic heterogeneity to increase population-level phenotypic plasticity. Single molecule, real-time (SMRT) sequencing enables the detection of N6-methyladenine and N4-methylcytosine, two major types of DNA modifications comprising the bacterial methylome. However, existing SMRT sequencing-based methods for studying bacterial methylomes rely on a population-level consensus that lacks the single-cell resolution required to observe epigenetic heterogeneity. Here, we present SMALR (single-molecule modification analysis of long reads), a novel framework for single molecule-level detection and phasing of DNA methylation. Using seven bacterial strains, we show that SMALR yields significantly improved resolution and reveals distinct types of epigenetic heterogeneity. SMALR is a powerful new tool that enables de novo detection of epigenetic heterogeneity and empowers investigation of its functions in bacterial populations.

                                                                                                                                                    TP010 (PT) - Analysis of aggregated cell-cell statistical distances within pathways unveils therapeutic-resistance mechanisms in circulating tumor cells
                                                                                                                                                    Date: Sunday, July 10 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: DISEASE / SYSTEMS
                                                                                                                                                    • Alfred Schissler, Lussier Lab, United States
                                                                                                                                                    • Qike Li, The University of Arizona, United States
                                                                                                                                                    • James Chen, The Ohio State University, United States
                                                                                                                                                    • Colleen Kenost, The University of Arizona, United States
                                                                                                                                                    • Ikbel Achour, The University of Arizona, United States
                                                                                                                                                    • D. Dean Billheimer, The University of Arizona, United States
                                                                                                                                                    • Haiquan Li, University of Arizona, United States
                                                                                                                                                    • Walter W. Piegorsch, University of Arizona Center for Biomedical Informatics and Biostatistics, United States
                                                                                                                                                    • Yves Lussier, University of Arizona, United States

                                                                                                                                                    Area Session Chair: Ioannis Xenarios

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: As ‘omics’ biotechnologies accelerate the capability to contrast a myriad of molecular measurements from a single cell, they also exacerbate current analytical limitations for detecting meaningful single-cell dysregulations. Moreover, mRNA expression alone lacks functional interpretation, limiting opportunities for translation of single-cell transcriptomic insights to precision medicine. Lastly, most single-cell RNA-sequencing analytic approaches are not designed to investigate small populations of cells such as circulating tumor cells shed from solid tumors and isolated from patient blood samples.
                                                                                                                                                    Results: In response to these characteristics and limitations in current single-cell RNA-sequencing methodology, we introduce an analytic framework that models transcriptome dynamics through the analysis of aggregated cell-cell statistical distances within biomolecular pathways. Cell-cell statistical distances are calculated from pathway mRNA fold changes between two cells. Within an elaborate case study of circulating tumor cells derived from prostate cancer patients, we develop analytic methods of aggregated distances to identify five differentially expressed pathways associated to therapeutic resistance. Our aggregation analyses perform comparably to Gene Set Enrichment Analysis (GSEA) and better than differentially expressed genes followed by gene set enrichment. However, these methods were not designed to inform on differential pathway expression for a single cell. As such, our framework culminates with the novel aggregation method, cell-centric statistics (CCS). CCS quantifies the effect size and significance of differentially expressed pathways for a single cell of interest. Improved rose plots of differentially expressed pathways in each cell highlight the utility of CCS for therapeutic decision-making.
                                                                                                                                                    Availability: http://www.lussierlab.org/publications/CCS/

                                                                                                                                                    TP011 (PT) - Large-scale Text Mining Web Services for Bioinformatics Research
                                                                                                                                                    Date: Sunday, July 10 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DATA
                                                                                                                                                    • Chih-Hsuan Wei, NCBI, United States
                                                                                                                                                    • Robert Leaman, NCBI, United States
                                                                                                                                                    • Zhiyong Lu, NCBI, United States

                                                                                                                                                    Area Session Chair: Lenore Cowen

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Processing the biomedical literature with automated tools becomes more important as its growth accelerates. We present NCBI text-mining web services, an online version of our text mining suite for biomedical concept recognition and information extraction. Our service incorporates five state of the art tools we developed previously: DNorm (for diseases), GNormPlus (genes/proteins), SR4GN (species), tmChem (chemicals and drugs), and tmVar (variants). Using our service, users can instantly retrieve results from all five tools for any abstract in PubMed. Users may also process arbitrary text – such as full-text articles or non-PubMed publications – using our asynchronous batch mode, or easily visualize results through our web-based application PubTator. We simplify interoperability by supporting multiple data formats, and handle large requests through a computer cluster to ensure scalability. Our web service is already in wide use, supporting research projects in biocuration, crowdsourcing and translational bioinformatics. The web service is freely available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#curl

                                                                                                                                                    TP012 (PT) - Genetic Architectures of Quantitative Variation in RNA Editing Pathways
                                                                                                                                                    Date: Sunday, July 10 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Tongjun Gu, University of Florida, United States
                                                                                                                                                    • Daniel Gatti, The Jackson Laboratory, United States
                                                                                                                                                    • Anuj Srivastava, The Jackson Laboratory, United States
                                                                                                                                                    • Elizabeth Snyder, The Jackson Laboratory, United States
                                                                                                                                                    • Narayanan Raghupathy, The Jackson Laboratory, United States
                                                                                                                                                    • Petr Simecek, The Jackson Laboratory, United States
                                                                                                                                                    • Karen Svenson, The Jackson Laboratory, United States
                                                                                                                                                    • Ivan Dotu, The Jackson Laboratory, United States
                                                                                                                                                    • Jeffrey Chuang, The Jackson Laboratory, United States
                                                                                                                                                    • Mark Keller, University of Wisconsin, United States
                                                                                                                                                    • Alan Attie, University of Wisconsin, United States
                                                                                                                                                    • Robert Braun, The Jackson Laboratory, United States
                                                                                                                                                    • Gary Churchill, The Jackson Laboratory, United States

                                                                                                                                                    Area Session Chair: Alex Bateman

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    RNA editing refers to post-transcriptional processes that alter the base sequence of RNA. Recently, hundreds of new RNA editing targets have been reported. However, the mechanisms that determine the specificity and degree of editing are not well understood. We examined quantitative variation of site-specific editing in a genetically diverse multiparent population, Diversity Outbred mice, and mapped polymorphic loci that alter editing ratios globally for C-to-U editing and at specific sites for A-to-I editing. An allelic series in the C-to-U editing enzyme Apobec1 influences the editing efficiency of Apob and 58 additional C-to-U editing targets. We identified 49 A-to-I editing sites with polymorphisms in the edited transcript that alter editing efficiency. In contrast to the shared genetic control of C-to-U editing, most of the variable A-to-I editing sites were determined by local nucleotide polymorphisms in proximity to the editing site in the RNA secondary structure. Our results indicate that RNA editing is a quantitative trait subject to genetic variation and that evolutionary constraints have given rise to distinct genetic architectures in the two canonical types of RNA editing.

                                                                                                                                                    TP013 (PT) - DEVELOPMENT OF A BAYESIAN TENSOR FACTORIZATION MODEL TO PREDICT DRUG RESPONSE CURVES IN CANCER CELL LINES
                                                                                                                                                    Date: Sunday, July 10 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Nathan Lazar, Oregon Health & Science University, United States
                                                                                                                                                    • Mehmet Gonen, Koç University, Turkey
                                                                                                                                                    • Shannon McWeeney, Oregon Health & Science University, United States
                                                                                                                                                    • Adam Margolin, Oregon Health & Science University, United States
                                                                                                                                                    • Kemal Sonmez, Oregon Health & Science University, United States

                                                                                                                                                    Area Session Chair: Ioannis Xenarios

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Biological data is inherently multi-dimensional in nature, yet most computational methods used today are based to some extent on flattening these data into two-dimensional matrices. We present a new model BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values in a three dimensional response tensor using input features for each of the dimensions. We apply this to predict full dose response curves in a panel of 599 cancer cell lines treated with 545 compounds as part of the Cancer Target Discovery and Development1 (CTD2) effort. BaTFLED learns projection matrices mapping features for cell lines and drugs into latent representations that combine to form the responses. Predictions for new cell lines, drugs or combinations of the two can be made by multiplying through these projection matrices. A Bayesian framework allows us to place distributions on the unknown variables, which encourage sparsity both row-wise in the projection matrices (for feature selection) and in the core tensor which combines the latent vectors (selecting interactions between latent representations). We train the model using a highly efficient variational method that learns optimal parameters for a distribution approximating the true posterior. This talk will explore implications of model design choices, demonstrate initial results on the CTD2 data and discuss how these methods may be applied to other multi-dimensional datasets.

                                                                                                                                                    TP014 (PT) - Text as Data: Using text-based features for proteins representation and for computational prediction of their characteristics
                                                                                                                                                    Date: Sunday, July 10 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DATA / PROTEINS
                                                                                                                                                    • Hagit Shatkay, University of Delaware, United States
                                                                                                                                                    • Scott Brady, University of Toronto, Canada
                                                                                                                                                    • Andrew Wong, Mount Sinai Hospital, Canada

                                                                                                                                                    Area Session Chair: Lenore Cowen

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined.
                                                                                                                                                    Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year.
                                                                                                                                                    Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining.
                                                                                                                                                    Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text.
                                                                                                                                                    In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.

                                                                                                                                                    TP015 (PT) - A novel algorithm for calling mRNA m6A peaks by modeling biological variances in MeRIP-seq data
                                                                                                                                                    Date: Sunday, July 10 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Xiaodong Cui, UTSA, United States
                                                                                                                                                    • Jia Meng, Xi'an Jiaotong-liverpool University, China
                                                                                                                                                    • Shaowu Zhang, Northwestern Polytecnical University, China
                                                                                                                                                    • Yidong Chen, UTHSCSA, United States
                                                                                                                                                    • Yufei Huang, UTSA, United States

                                                                                                                                                    Area Session Chair: Alex Bateman

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: N6-methyl-adenosine (m6A) is the most prevalent mRNA methylation but precise pre-diction of its mRNA location is important for understanding its function. A recent sequencing tech-nology, known as Methylated RNA Immunoprecipitation Sequencing technology (MeRIP-seq), has been developed for transcriptome-wide profiling of m6A. We previously developed a peak calling algorithm called exomePeak. However, exomePeak over-simplifies data characteristics and ig-nores the reads’ variances among replicates or reads dependency across a site region. To further improve the performance, new model is needed to address these important issues of MeRIP-seq data.
                                                                                                                                                    Results: We propose a novel, graphical model-based peak calling method, MeTPeak, for tran-scriptome-wide detection of m6A sites from MeRIP-seq data. MeTPeak explicitly models reads count of an m6A site and introduces a hierarchical layer of Beta variables to capture the variances and a Hidden Markov model (HMM) to characterize the reads dependency across a site. In addi-tion, we developed a constrained Newton’s method and designed a log-barrier function to compute analytically intractable, positively constrained Beta parameters. We applied our algorithm to simu-lated and real biological datasets and demonstrated significant improvement in detection perfor-mance and robustness over exomePeak. Prediction results on publicly available MeRIP-seq da-tasets are also validated and shown to be able to recapitulate the known patterns of m6A, further validating the improved performance of MeTPeak.
                                                                                                                                                    Availability: The package ‘MeTPeak’ is implemented in R and C++, and additional details are available at https://github.com/compgenomics/MeTPeak

                                                                                                                                                    TP016 (PT) - DrugE-Rank: Improving Drug-Target Interaction Prediction of New Candidate Drugs or Targets by Ensemble Learning to Rank
                                                                                                                                                    Date: Sunday, July 10 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Qing-Jun Yuan, Fudan University, China
                                                                                                                                                    • Junning Gao, FDU, China
                                                                                                                                                    • Dongliang Wu, Fudan University, China
                                                                                                                                                    • Shihua Zhang, University of Southern Canlifornia, United States
                                                                                                                                                    • Hiroshi Mamitsuka, Kyoto University, Japan
                                                                                                                                                    • Shanfeng Zhu, Fudan University, China

                                                                                                                                                    Area Session Chair: Ioannis Xenarios

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Identifying drug-target interaction is an important task in drug discovery. To reduce heavy time and financial cost in experimental identification of drug-target interaction, many computational approaches have been proposed. Although these approaches have used many different principles, their performance is far from satisfactory, especially in predicting drug-target interactions of new drugs or new targets.

                                                                                                                                                    Methods: Approaches based on machine learning for this problem can be divided into two types: feature based and similarity-based methods. Learning to rank (LTR) is the known, most powerful technique in the feature-based methods, while similarity-based methods are well-accepted, due to their idea of connecting the chemical and genomic spaces, represented by drug and target similarities, respectively. We propose a
                                                                                                                                                    new method, DrugE-Rank, to improve the performance of the problem by nicely combining the advantages of the two different types of the methods. That is, DrugE-Rank uses LTR, for which multiple well-known similarity-based methods can be used as components of ensemble learning.

                                                                                                                                                    Results: The performance of DrugE-Rank was thoroughly examined by mainly three experiments, using data from DrugBank: 1) cross-validation on FDA (US Food and Drug Administration) approved drugs before March 2014, 2) independent test on FDA approved drugs after March 2014, and 3) independent test on FDA experimental drugs. Experimental results show that DrugE-Rank outperformed competing methods significantly, especially achieving more than 30% improvement in AUPR (Area under Prediction Recall curve) for FDA approved new drugs and FDA experimental drugs.

                                                                                                                                                    TP017 (PT) - Good news: we are getting better at predicting protein function
                                                                                                                                                    Date: Sunday, July 10 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS / DATA
                                                                                                                                                    • Predrag Radivojac, Indiana University, United States
                                                                                                                                                    • Yuxiang Jiang, Indiana University, United States
                                                                                                                                                    • Sean Mooney, University of Washington, United States
                                                                                                                                                    • Tal Ronen-Oron, The Buck Institute for Aging Resarch, United States
                                                                                                                                                    • Casey Greene, University of Pennsylvania, United States
                                                                                                                                                    • Iddo Friedberg, Iowa State University, United States

                                                                                                                                                    Area Session Chair: Lenore Cowen

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging.

                                                                                                                                                    Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2.

                                                                                                                                                    Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction.

                                                                                                                                                    TP018 (PT) - RNAiFold2T: Constraint Programming design of thermo-IRES switches
                                                                                                                                                    Date: Sunday, July 10 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Juan Antonio Garcia-Martin, Department of Biology, Boston College, United States
                                                                                                                                                    • Ivan Dotu, Research Programme on Biomedical Informatics (GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Spain
                                                                                                                                                    • Javier Fernandez-Chamorro, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
                                                                                                                                                    • Gloria Lozano, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
                                                                                                                                                    • Jorge Ramajo, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
                                                                                                                                                    • Encarnacion Martinez-Salas, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
                                                                                                                                                    • Peter Clote, Department of Biology, Boston College, United States

                                                                                                                                                    Area Session Chair: Alex Bateman

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: RNA thermometers (RNATs) are cis-regulatory elements that change secondary structure
                                                                                                                                                    upon temperature shift. Often involved in the regulation of heat shock, cold shock and virulence genes,
                                                                                                                                                    RNATs constitute an interesting potential resource in synthetic biology, where engineered RNATs could
                                                                                                                                                    prove to be useful tools in biosensors and conditional gene regulation.
                                                                                                                                                    Results: Solving the 2-temperature inverse folding problem is critical for RNAT engineering. Here
                                                                                                                                                    we introduce RNAiFold2T, the first Constraint Programming (CP) and Large Neighborhood Search
                                                                                                                                                    (LNS) algorithms to solve this problem. Benchmarking tests of RNAiFold2T against existent programs
                                                                                                                                                    (adaptive walk and genetic algorithm) inverse folding show that our software generates two orders of
                                                                                                                                                    magnitude more solutions, thus allowing ample exploration of the space of solutions. Subsequently,
                                                                                                                                                    solutions can be prioritized by computing various measures, including probability of target structure in the
                                                                                                                                                    ensemble, melting temperature, etc. Using this strategy, we rationally designed two thermosensor internal
                                                                                                                                                    ribosome entry site (thermo-IRES) elements, whose normalized cap-independent translation efficiency is
                                                                                                                                                    approximately 50% greater at 42C than 30C, when tested in reticulocyte lysates. Translation efficiency
                                                                                                                                                    is lower than that of the wild-type IRES element, which on the other hand is fully resistant to temperature
                                                                                                                                                    shift-up. This appears to be the first purely computational design of functional RNA thermoswitches, and
                                                                                                                                                    certainly the first purely computational design of functional thermo-IRES elements.
                                                                                                                                                    Availability: RNAiFold2T is publicly available as as part of the new release RNAiFold3.0
                                                                                                                                                    at https://github.com/clotelab/RNAiFold and http://bioinformatics.bc.edu/
                                                                                                                                                    clotelab/RNAiFold, which latter has a web server as well. The software is written in C++ and
                                                                                                                                                    uses OR-Tools CP search engine.
                                                                                                                                                    Contact: clote@bc.edu
                                                                                                                                                    Supplementary information: Supplementary data are available at Bioinformatics online.

                                                                                                                                                    TP019 (PT) - Temporal dynamics of collaborative networks in large scientific consortia
                                                                                                                                                    Date: Sunday, July 10 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: SYSTEMS / DATA
                                                                                                                                                    • Daifeng Wang, Yale University, United States
                                                                                                                                                    • Koon-Kiu Yan, Yale University, United States
                                                                                                                                                    • Joel Rozowsky, Yale University, United States
                                                                                                                                                    • Eric Pan, Yale University, United States
                                                                                                                                                    • Mark Gerstein, Yale University, United States

                                                                                                                                                    Area Session Chair: Hagit Shatkay

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The emergence of collective creative enterprise such as large scientific consortia is a unique feature in modern scientific research, especially in the biomedical field. Recent examples include the ENCyclopedia Of DNA Elements (ENCODE) consortium annotating the human genome and the 1000 Genomes consortium generating a catalog of uniformly called variants for the biomedical community. To ensure that the scientific community can benefit from these efforts, it is important to understand the connections between consortium members and researchers outside of the consortium. To address the issue, we analyzed the temporal co-authorship network structures of ENCODE and modENCODE consortia [1]. Our analysis revealed their publication patterns showing that the consortium members work closely as a community whereas non-members collaborate in the scale of a few laboratories. We also identified a few brokers playing an important role to facilitate collaborations with outside researchers, which suggests that large scientific consortia should set up formal an outreach group to communicate with outside researchers.

                                                                                                                                                    [1] Daifeng Wang, Koon-Kiu Yan, Joel Rozowsky, Eric Pan, Mark Gerstein, "Temporal dynamics of collaborative networks driven by large scientific consortia," in press, Trends in Genetics, 2016, doi: 10.1016/j.tig.2016.02.006

                                                                                                                                                    TP020 (PT) - INTEGRATIVE COMPUTATIONAL MODELING ACROSS TUMORS REVEALS CONTEXT SPECIFIC IMPACT OF MUTATIONS
                                                                                                                                                    Date: Sunday, July 10 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DISEASE / GENES
                                                                                                                                                    • Hatice Osmanbeyoglu, Memorial Sloan Kettering Cancer Center, United States
                                                                                                                                                    • Eneda Toska, Memorial Sloan Kettering Cancer Center, United States
                                                                                                                                                    • Jose Baselga, Memorial Sloan Kettering Cancer Center, United States
                                                                                                                                                    • Christina Leslie, Memorial Sloan Kettering Cancer Center, United States

                                                                                                                                                    Area Session Chair: Paul Horton

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Pan-cancer analyses of somatic mutations and copy number aberrations have confirmed that the same genes or pathways are often altered across multiple tumor types. There is great interest in deploying targeted therapies in a pan-cancer manner, matching pathway-targeted drugs to the mutational profile of the tumor regardless of cancer type. However, ‘actionable mutations’ interact with distinct cancer-specific gene regulatory programs and signaling networks and can occur against different genetic backgrounds across tumor types. To better model the context-dependent role of somatic alterations, we applied a novel computational strategy for integrating parallel phosphoproteomic and mRNA sequencing data across 12 the The Cancer Genome Atlas (TCGA) tumor data sets, linking dysregulation of upstream signaling pathways with altered transcriptional response. We then developed a statistical approach to interpret the impact of mutations and copy number events in terms of functional outcomes such as altered signaling and transcription factor (TF) activity. Our analysis revealed both known and novel transcriptional regulators downstream of oncogenic pathways. These results have implications for the prospective experimental investigation of targeted therapies in tumors harboring specific mutations. Our evolving understanding of the context-dependent role of somatic alterations may potentially enhance current approaches for combinatorial clinical trial design.

                                                                                                                                                    TP021 (PT) - Boosting alignment accuracy through adaptive local realignment
                                                                                                                                                    Date: Sunday, July 10 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Dan Deblasio, University of Arizona, United States
                                                                                                                                                    • John Kececioglu, University of Arizona, United States

                                                                                                                                                    Area Session Chair: Jianlin Cheng

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Mutation rates can vary across the residues of a protein, but when multiple sequence alignments are computed for protein sequences, the same choice of values for the substitution score and gap penalty parameters is often used across their entire length. We provide for the first time a new method called adaptive local realignment that automatically uses diverse alignment parameter settings in different regions of the input sequences when computing protein multiple sequence alignments. This allows parameter settings to locally adapt across a protein to more closely match varying mutation rates.

                                                                                                                                                    Our method builds on our prior work on global alignment parameter advising with the Facet alignment accuracy estimator. Given a computed alignment, in each region that has low estimated accuracy, a collection of candidate realignments is generated using a precomputed set of alternate parameter choices. If one of these alternate realignments has higher estimated accuracy than the original subalignment, the region is replaced with the realignment, and the concatenation of these realigned regions forms the new output alignment.

                                                                                                                                                    Adaptive local realignment significantly improves the quality of alignments over using the single best default parameter choice. In particular, this new method of local advising, when combined with prior methods for global advising, boosts alignment accuracy by almost 23% over the best default parameter setting on the hardest-to-align benchmarks (and almost 5.9% over using global advising alone).

                                                                                                                                                    A new version of the Opal multiple sequence aligner that incorporates adaptive local realignment, using Facet for parameter advising, is available free for non-commercial use at facet.cs.arizona.edu.

                                                                                                                                                    TP022 (PT) - Positive and negative forms of replicability in gene network analysis
                                                                                                                                                    Date: Sunday, July 10 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: SYSTEMS / DATA
                                                                                                                                                    • Wim Verleyen, Cold Spring Harbor Laboratory, United States
                                                                                                                                                    • Sara Ballouz, Cold Spring Harbor Laboratory, United States
                                                                                                                                                    • Jesse Gillis, Cold Spring Harbor Laboratory, United States

                                                                                                                                                    Area Session Chair: Hagit Shatkay

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Presentation description
                                                                                                                                                    In this work, we build a model of scientific communities in which simulated researchers characterizes gene function through an individual analysis of particular network data. We model each researcher by sampling from a pool of machine learning algorithms, each of which then samples individually from various public resources. By simulating groups of researchers operating under different constraints, we are able to assess practices leading to successful group decisions. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes ‘easy’ or uninformative replication to dominate analyses. We provide examples of this problem in action and walk through seminal results which replicate precisely because they are unlikely to be true. We also show that this bias has a strong impact in protein-protein interaction data leading to negative correlations between replicability and good quality control. We discuss some implications for public discourse, particularly on scientific matters.

                                                                                                                                                    Scientific Justification
                                                                                                                                                    Our recent work analyzes what is usually considered a fundamental basis of science – replication – and shows that not only can it be useless as a general heuristic for discovering the truth, it can be damaging when applied naively. Intuitively, the idea is close to that of overfitting in machine learning. Two researchers both of whom overfit to some data might obtain more replicable results, but this form of replicability is of little value. Using real data and analysis techniques, we show this problem is apparent in the field of gene network analysis as a whole.

                                                                                                                                                    While we focus on the field-wide meta-analysis, the detailed examples in the paper are particularly important:

                                                                                                                                                    A) We show that a seminal result in autism genetics replicates because it is false. Our detailed walk-through makes results that are otherwise very surprising into intuitive principles.

                                                                                                                                                    B) We show that the negative relationship our model predicts between replicability and quality control can be seen directly in even reports for individual protein-protein interactions.

                                                                                                                                                    Our research in this area is ongoing and our talk will discuss additional examples, drawn principally from medically important cases (e.g., point (A)) which I think will be of high interest at ISMB, as well as methods for identifying these problems.

                                                                                                                                                    Although the focus is on networks, the model and examples are of relevance to any knowledge-base (hence our area choice). This is work that repays careful consideration and I’m confident that discussing it at ISMB will provide exceptional value to our colleagues.

                                                                                                                                                    TP023 (PT) - COSMOS: accurate detection of somatic structural variations through asymmetric comparison between tumor and normal samples
                                                                                                                                                    Date: Sunday, July 10 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DISEASE / GENES
                                                                                                                                                    • Koichi Yamagata, AIST, Japan
                                                                                                                                                    • Ayako Yamanishi, Graduate School of Medicine, Osaka University, Japan
                                                                                                                                                    • Chikara Kokubu, Graduate School of Medicine, Osaka University, Japan
                                                                                                                                                    • Junji Takeda, Graduate School of Medicine, Osaka University, Japan
                                                                                                                                                    • Jun Sese, AIST, Japan

                                                                                                                                                    Area Session Chair: Paul Horton

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    An important challenge in cancer genomics is precise detection of structural variations (SVs) by high-throughput short-read sequencing, which is hampered by the high false discovery rates of existing analysis tools. Here we propose an accurate SV detection method named COSMOS, which compares the statistics of the mapped read pairs in tumor samples with isogenic normal control samples in a distinct asymmetric manner. COSMOS also prioritizes the candidate SVs using strand-specific read-depth information. Performance tests on modeled tumor genomes revealed that COSMOS outperformed existing methods in terms of F-measure. We also applied COSMOS to an experimental mouse cell-based model, in which SVs were induced by genome engineering and gamma-ray irradiation, followed by polymerase chain reaction-based confirmation. The precision of COSMOS was 84.5 %, while the next best existing method was 70.4%. Moreover, the sensitivity of COSMOS was the highest, indicating that COSMOS has great potential for cancer genome analysis.

                                                                                                                                                    TP024 (PT) - The Post-Genomic Era of Biological Network Alignment: Latest Insights
                                                                                                                                                    Date: Sunday, July 10 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: SYSTEMS
                                                                                                                                                    • Lei Meng, University of Notre Dame, United States
                                                                                                                                                    • Vipin Vijayan, University of Notre Dame, United States
                                                                                                                                                    • Tijana Milenkovic, University of Notre Dame, United States

                                                                                                                                                    Area Session Chair: Jianlin Cheng

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Analogous to genomic sequence alignment, biological network alignment (NA) aims to find regions of similarities between molecular networks of different species. NA can be divided into local (LNA) or global (GNA). LNA finds small, highly conserved network regions; GNA finds large, suboptimally conserved regions. When a new NA method is proposed, it is compared against existing methods from the same NA category. However, both LNA and GNA aim to allow for transferring functional knowledge from well- to poorly-studied species between conserved (aligned) network regions. So, which one to choose, LNA or GNA? To answer this, we introduce the first systematic evaluation of the two NA categories and new measures of alignment quality that allow for fair comparison of the different LNA and GNA outputs. We find that LNA and GNA give complementary results: LNA has high functional but low topological quality, while GNA has the opposite. Thus, we propose IGLOO, a new approach that integrates GNA and LNA. IGLOO allows for a trade-off between topological and functional alignment quality better than any existing LNA and GNA methods. NA can also be divided into pairwise NA of two networks (PNA) vs. multiple NA of more than two networks (MNA). MNA may be more useful since it can capture at once biological knowledge common to multiple species. We present multiMAGNA++, a novel and superior MNA approach, and we introduce new MNA quality measures to allow for more complete alignment characterization and more fair MNA method evaluation compared to the existing measures.

                                                                                                                                                    TP025 (PT) - Efficient Data-Driven Model Learning for Dynamical Systems
                                                                                                                                                    Date: Sunday, July 10 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: SYSTEMS / DATA
                                                                                                                                                    • Ermao Cai, Carnegie Mellon University, United States
                                                                                                                                                    • Ifigeneia Apostolopoulou, Carnegie Mellon University, United States
                                                                                                                                                    • Pranay Ranjan, Carnegie Mellon University, United States
                                                                                                                                                    • Paul Pan, Carnegie Mellon University, United States
                                                                                                                                                    • Mark Wuebbens, Carnegie Mellon University, United States
                                                                                                                                                    • Diana Marculescu, Carnegie Mellon University, United States

                                                                                                                                                    Area Session Chair: Hagit Shatkay

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    In the analysis of non-linear dynamical biological systems, it is often of interest to determine an efficient, qualitative estimate of the behavior of the state variables as opposed to exact, quantitative measures which may be intractable or too expensive to obtain. Moreover, established closed form mathematical rules governing system behavior are not always available and one may need to emulate the nature of the system on the basis of observations and experimental data only. In this paper, we propose to rely on Boolean models for analyzing dynamical systems and develop a polynomial time complexity heuristic algorithm to infer such Boolean functions for dynamical systems with refractory periods. Our algorithm is structured to perform even more efficiently for systems with a nested canalizing behavior with respect to certain features, which is indeed the case for life science applications. For data obtained from existing dynamical systems, e.g., T helper (Th) cell signaling network, T-LGL survival network, and T-cell differentiation, our algorithm is 100X faster than two other state-of-the-art methods, yet achieves similar or better accuracy.

                                                                                                                                                    TP026 (PT) - intSKAT, an integrated Sequence Kernel Association Test, to identify novel clinically impactful somatic mutations in melanomas
                                                                                                                                                    Date: Sunday, July 10 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Yian Chen, Moffitt Cancer Center, United States
                                                                                                                                                    • Zachary Thompson, Moffitt Cancer Center, United States
                                                                                                                                                    • Jamie Teer, Moffitt Cancer Center, United States
                                                                                                                                                    • Fernanda Flores, Moffitt Cancer Center, United States
                                                                                                                                                    • Manali Phadke, Moffitt Cancer Center, United States
                                                                                                                                                    • Zhihua Chen, Moffitt Cancer Center, United States
                                                                                                                                                    • Eric Welsh, Moffitt Cancer Center, United States
                                                                                                                                                    • Michael Schell, Moffitt Cancer Center, United States
                                                                                                                                                    • Keiran Smalley, Moffitt Cancer Center, United States

                                                                                                                                                    Area Session Chair: Paul Horton

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    INTRODUCTION
                                                                                                                                                    In recent years, much has been learned about the molecular basis of progression or developing therapeutic strategies based on mutation information for some of the cancer types. Taking melanoma as an example, it is known that ~50% of the melanomas have BRAF mutations and BRAF inhibitors have been developed with initial success for treatment. However, after accounting for patients with major known driver mutations: BRAF (~50%) and NRAS (~15% -20%), and NF1 (~14%), there is still ~20% of melanoma patients without clear known mutation drivers responsible for driving the development or aggressiveness of the disease. The lack of identified important non-passenger mutations in this subgroup (or any other cancer types) yields a significant challenge and also provides a great opportunity for developing therapeutic strategies. This becomes particularly important for developing personalized therapeutic strategies.
                                                                                                                                                    Traditionally, the driver mutations are identified through one of the following ways: if their frequencies are higher than expected some methods would determine positive selection for non-silent mutations (such as frameshift indels, nonsense and splice-site mutations) by weighting the predicted functional impact and observed frequencies. Although these methods have been shown to be useful for identifying driver mutations, at the same time, it is also understandable that these approaches will have limited power to detect infrequently mutated driver genes.
                                                                                                                                                    We proposed an expansive and integrated approach to link genotype to phenotypes to identify clinically relevant somatic mutations. This is accomplished by performing a flexible and powerful gene-based association test, intSKAT, to investigate the association between mutations in each gene and patients’ overall survival outcome.
                                                                                                                                                    METHODS
                                                                                                                                                    Built upon a gene-based sequence kernel association test (SKAT) [1], developed for germline studies, we developed an integrated association test, intSKAT, to identify novel somatic mutations, which are associated with clinically relevant outcome, e.g., overall survival (OS). We first coded the multi-allelic mutations into bi-allelic variants with reference versus alternative allele. Our method included an expansive suite of eight gene-based methods: 1. Burden test, 2. SKAT, 3. SKAT-O, 4-6, Burden, SKAT, SKAT-O weighted by PolyPhen-2 score, 7. Cox Regression with mutation status in a gene (0/1) as the predictor, and 8. Cox Regression with number of mutation in a gene as the predictor.
                                                                                                                                                    This method not only could evaluate joint effects of mutations within a gene, identify important genes with infrequent mutations, but also has the flexibility of leveraging functional predictions when available. It also allows the combinations of different directions of mutations (protective or deleterious), and different levels of functional predictions (unknown or functional prediction) to be ranked high. FDR is performed to adjust for multiple comparison within each method, and minFDR of 10-3 across all methods is used to declare the statistical significance. Furthermore, we performed robust regression to regress number of mutations within each gene against the length of longest transcript. The genes with significantly associated with OS and also higher than expected standardized residual were considered more likely to be non-passenger genes.
                                                                                                                                                    Using the targeted exome sequencing data in 185 melanomas patients from the Total Cancer Care (TCC) database at Moffitt Cancer Ceter we applied intSKAT to investigate the association between mutations in genes and patients’ OS as a proof of principle study. Briefly about the sequencing and variant calls, tumor samples from the TCC project were subjected to genomic capture (performed by BGI, Shenzhen using SureSelect custom designs targeting 1,321 genes, Agilent Technologies, Inc., Santa Clara, CA) and massively parallel sequencing.. Sequences were aligned to the hs37d5 human reference with the Burrows-Wheeler Aligner (BWA). Insertion/deletion realignment, quality score recalibration, and variant identification were performed with the Genome Analysis ToolKit (GATK). Sequence variants were annotated with ANNOVAR and custom scripts. We limited variants to those within the 1,321 gene target regions plus 100 flanking base pairs. High quality variants were retained by including only variants with GQ score >=15 and excluding variants in the least specific VQSR Tranche (100.00). Variant were further retained if >=80% of the samples had a high quality genotype call (reference or variants) at that position. Somatic mutations were enriched by removing variants observed >1% in 1000 Genomes, ESP African or ESP European populations. Variants were also removed if observed >1% in a set of 238 normal tissue samples subjected to the same capture and sequencing procedure. Variant were finally filtered to include only protein altering (nonsynonymous, frameshifting or non-frameshifting indels, stopgain, stoploss, and splicing variants) or only protein altering plus UTR.

                                                                                                                                                    In addition to performing intSKAT, we performed robust regression to further narrow down the non-passenger mutations, which can drive the disease aggressiveness in the discovery phase (Figure 1). For validation, we used the whole exome sequencing data and overall survival information from TCGA (N=211) to validate our approach (Figure 1). For validation studies, variants were limited to the 1,321 gene target regions plus 100 flanking base pairs. We decided to use real world sequencing data patients’ survival data to reflect the real-world complexity.

                                                                                                                                                    Finally, after identifying our top gene with mutations, we performed cell line experiments to elucidate the potential roles of the mutations in the gene(s). The melanoma cell lines Malme-3M and MeWo were purchased from ATCC. Malme-3M and MeWo cells were cultured in RPMI complete medium with 20% and 10% FBS, respectively. Cells were grown at 37°C in a 5% CO2 humidified atmosphere.

                                                                                                                                                    Three-dimensional spheroid assay
                                                                                                                                                    The three-dimensional melanoma spheroids were prepared using the liquid overlay method. Melanoma cells were added to a 96-well plate coated with agar by 72h. Spheroids were harvested and implanted into a collagen I and left to grow for 72h. Then, spheroids were washed in PBS and treated with Calcein-AM and propidium iodide for 1h at 37°C. After, pictures were taken using a Nikon-300 inverted fluorescence microscope. The percentage of invasion was determined using ImageJ software. siEPHA7 knockdown experiments were performed to investigate its effect on invation.

                                                                                                                                                    Inverse Matrigel invasion assay
                                                                                                                                                    The matrigel invasion assay was performed. Matrigel was prepared 1:1 in ice cold PBS and inserted in 8 micron pore 6.5 mm diameter uncoated Transwells into the wells of a 24 well tissue culture plate and incubated for 30 min at 37°C. Cell suspensions (1 x 105/ml) were added onto the upward facing underside of the filter and incubated in the inverted state for 4 hours. Each transwell was washed in serum free medium, 100 μl of RPMI with 10% FBS was added into the transwell and incubated for 72h at 37°C. The cells were fixed in 1 ml of 4% para-formaldehyde/0.2% Triton-X 100 and staining with 1mL of 4 μM Calcein AM solution for 1h at room temperature. The images were obtained by confocal microscopy. siEPHA7 knockdown experiments were performed to investigate its effect on invation.

                                                                                                                                                    RESULTS & DISCUSSION
                                                                                                                                                    A total of 22,848 variants were identified in the1,345 genes with 24 genes were genes near the targeted 1,321 genes. In the discovery phase, 12 genes have minFDR < 10-3. Among which, 6 genes with standardized residuals greater 2 are: ADAMTS18, DNAH8, EPHA7, LRP1B, MUC16, and TTN (p<0.008). We are in the process of downloading and processing the TCGA data for a formal validation analyses. We did a quick validation and looked up the association between mutations and OS using cBioportal. Among the 6 genes, 3 of the 6 validated (p <0.05) using this initial quick lookup through cBioportal were: EPHA7 (P Burden = 1.47x10-7 for TCC; p log-rank test = 0.03 for TCGA) and MUC16 (P Burden = 2.23x10-6 for TCC; p log-rank test = 0.015 for TCGA). TTN (P Burden = 1.22x10-6 for TCC) has similar trend observed in TCGA but p = 0.07 using log rank test. The melanoma cell lines Malme-3M and MeWo were purchased from ATCC. These cell lines contain some mutations. Both The knockdown experiments siEPHA7 using 3-D spheroid assays and Inverse Matrigel invasion assays showed that EPHA7 knockdown significantly reduce the cell invasion by 40% (p<0.01) and by 53.4% (p<0.001), respectively. The striking impact after EPHA7 knockdown on both cell survival and cell invasion showed that EPHA7 likely played a major role as a regular for metastases.
                                                                                                                                                    Identifying EPHA7 as a gene with important non-passenger mutations demonstrated the power of evaluating the association between infrequent mutations jointly in a gene with patients’ clinical outcomes.

                                                                                                                                                    CONCLUSIONS
                                                                                                                                                    Through our three-phase melanoma study, we have demonstrated that our proposed integrated approach, combining intSKAT and robust regression, can successfully identify novel clinically impactful genes with mutations. Our proposed method should be readily applicable to discover novel mutations for other cancer types and provide potentially important strategies for personalized treatment options.



                                                                                                                                                    FIGURE 1. A three-phase study design to test our proposed method intSKAT, an integrated approach to discover novel clinically impactful mutations in melanoma patients.
                                                                                                                                                    [Add the figure legend text here].

                                                                                                                                                    REFERENCES

                                                                                                                                                    1. Wu, M.C., et al., Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 2011. 89(1): p. 82-93.

                                                                                                                                                    TP027 (PT) - Covariation Is a Poor Measure of Molecular Coevolution
                                                                                                                                                    Date: Sunday, July 10 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • David Talavera, University of Manchester, United Kingdom
                                                                                                                                                    • Simon Lovell, University of Manchester, United Kingdom
                                                                                                                                                    • Simon Whelan, Uppsala University, Sweden

                                                                                                                                                    Area Session Chair: Jianlin Cheng

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Covariation of amino-acid residues is widely studied for applications such as protein structure prediction, protein design and analysis of protein-protein interactions. However, there is no consensus as to the underlying evolutionary mechanisms that give rise to covariation. We have developed a theoretical model with the aim of understanding the origins of covariation. Our model predicts that covariation is generated only if strong selective pressure is present for extremely long periods of time. Our empirical analyses confirm this expectation as we demonstrate 1) that covariation methods select pairs of residues with slow evolutionary rates; and, 2) that the location of conserved residues in the core of the protein structure explains the precision of these methods at finding residues in close proximity. Altogether, our results explain the relative performance and limitations of current covariation methods, and the difficulties for developing evolutionary models for detecting coevolution.

                                                                                                                                                    TP028 (PT) - Quantitative analysis of microRNA mediated regulation on competing endogenous RNAs
                                                                                                                                                    Date: Sunday, July 10 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Ye Yuan, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
                                                                                                                                                    • Bing Liu, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
                                                                                                                                                    • Peng Xie, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
                                                                                                                                                    • Michael Zhang, Department of Molecular and Cell Biology, Center for Systems Biology, University of Texas, Dallas, United States
                                                                                                                                                    • Yanda Li, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
                                                                                                                                                    • Zhen Xie, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
                                                                                                                                                    • Xiaowo Wang, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China

                                                                                                                                                    Area Session Chair: Hagit Shatkay

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Each microRNA species can bind various types of target RNAs. Therefore, target RNAs could indirectly regulate each other by sequestering shared microRNAs. This phenomenon is called competing endogenous RNAs (ceRNA) effect. The off-target phenomenon in RNAi technology is also closely related to this effect. With the combination of systems biology modeling analysis and synthetic biology experiments, we established a mathematical model to describe the microRNA regulation and built relative synthetic gene circuits in cultured human cells to quantify the ceRNA effect under variable conditions. The results suggested that the ceRNA effect is affected by the abundance of microRNA and targets, the number and affinity of binding site, and the mRNA degradation pathway determined by the degree of microRNA-mRNA complementarity. Furthermore, a non-reciprocal competing effect of microRNA and RNAi was also demonstrated, while providing a new direction for the improvement of RNAi technology.

                                                                                                                                                    TP029 (PT) - A Weighted Exact Test for Significance of Mutually Exclusive Mutations in Cancer
                                                                                                                                                    Date: Sunday, July 10 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DISEASE / GENES
                                                                                                                                                    • Mark Leiserson, Brown University, United States
                                                                                                                                                    • Matthew Reyna, Brown University, United States
                                                                                                                                                    • Benjamin Raphael, Brown University, United States

                                                                                                                                                    Area Session Chair: Paul Horton

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Large-scale cancer sequencing efforts over the past decade from consortia such as The Cancer Genome Atlas have revealed that different combinations of mutations cause cancer in different patients. One method for distinguishing the driver mutations responsible for cancer from the random mutations with no role in cancer is to search for combinations of mutations that are mutually exclusive across tumors. We introduce a new statistical test for mutual exclusivity that uses the observed number of mutations in genes and tumor samples. The statistical test weights mutations with per gene, per sample mutation probabilities. We present a formula for computing this test exactly, and derive an approximation that can compute the tail probability quickly and accurately. We demonstrate our approach by applying it to hundreds of colorectal, thyroid, and endometrial cancers.

                                                                                                                                                    TP030 (PT) - CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction
                                                                                                                                                    Date: Sunday, July 10 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Xuefeng Cui, KAUST, Saudi Arabia
                                                                                                                                                    • Zhiwu Lu, Renmin University, China
                                                                                                                                                    • Sheng Wang, Toyota Technological Institute at Chicago, United States
                                                                                                                                                    • Jingyan Wang, KAUST, Saudi Arabia
                                                                                                                                                    • Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia

                                                                                                                                                    Area Session Chair: Jianlin Cheng

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step towards predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading, and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information.

                                                                                                                                                    Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration.

                                                                                                                                                    Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8,332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.

                                                                                                                                                    TP031 (PT) - Reconstructing the temporal progression of HIV-1 immune response pathways
                                                                                                                                                    Date: Sunday, July 10 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: SYSTEMS / DISEASE
                                                                                                                                                    • Siddhartha Jain, Carnegie Mellon University, United States
                                                                                                                                                    • Joel Arrais, Universidade de Aveiro, IEETA, Portugal
                                                                                                                                                    • Narasimhan J. Venkatachari, University of Pittsburgh, United States
                                                                                                                                                    • Velpandi Ayyavoo, University of Pittsburgh, United States
                                                                                                                                                    • Ziv Bar-Joseph, Carnegie Mellon University, United States

                                                                                                                                                    Area Session Chair: Hagit Shatkay

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    We present TimePath, a new method that integrates time series and static datasets to reconstruct dynamic models of host response to stimulus. TimePath uses an Integer Programming formulation to select a subset of pathways that, together, explain the observed dynamic responses. Applying TimePath to study human response to HIV-1 led to accurate reconstruction of several known regulatory and signaling pathways and to novel mechanistic insights. We experimentally validated several of TimePaths' predictions highlighting the usefulness of temporal models.

                                                                                                                                                    TP032 (PT) - Clonal evolution inference and visualization in metastatic colorectal cancer
                                                                                                                                                    Date: Sunday, July 10 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DISEASE / GENES
                                                                                                                                                    • Ha X. Dang, Washington University in St. Louis, United States
                                                                                                                                                    • Julie Grossman, Washington University in St. Louis, United States
                                                                                                                                                    • Brian White, Washington University in St. Louis, United States
                                                                                                                                                    • Steven Foltz, Washington University in St. Louis, United States
                                                                                                                                                    • Christopher Miller, Washington University in St. Louis, United States
                                                                                                                                                    • Jingqin Luo, Washington University in St. Louis, United States
                                                                                                                                                    • Timothy Ley, Washington University in St. Louis, United States
                                                                                                                                                    • Richard Wilson, Washington University in St. Louis, United States
                                                                                                                                                    • Elaine Mardis, Washington University in St. Louis, United States
                                                                                                                                                    • Ryan Fields, Washington University in St. Louis, United States
                                                                                                                                                    • Christopher Maher, Washington University in St. Louis, United States

                                                                                                                                                    Area Session Chair: Paul Horton

                                                                                                                                                    Presentation Overview: Show


                                                                                                                                                    Dissecting genomic heterogeneity and clonal evolution in tumors is critical to understanding cancer progression, metastasis, and recurrence. To identify subclonal populations of cancer cells, somatic variants identified via sequencing are often clustered across tumor samples based on their variant allele frequencies (VAF) or cancer cell cellular fractions (CCF). We developed ClonEvol, a tool to infer and visualize clonal evolution models in multiple related tumor samples using pre-clustered variants. We demonstrated that ClonEvol was able to infer clonal evolution models using a published and simulated datasets. We also used ClonEvol to infer clonal evolution models for an unpublished dataset of whole genome/exome and targeted sequencing of multi organ multi region primary and metastatic tumors from a metastatic colorectal cancer cohort. We discovered that metastasis seeding in colorectal cancers were complex events that involved multiple subclones from primary and metastatic tumors. Moreover, the critical subclones that drove metastasis were often missed when a single biopsy was sequenced from the primary tumors, thus necessitated multi region sequencing in monitoring clonal evolution and identifying critical events driving metastasis. ClonEvol is available at https://github.com/hdng/clonevol

                                                                                                                                                    TP033 (PT) - Ensemble-Based Evaluation for Protein Structure Models
                                                                                                                                                    Date: Sunday, July 10 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Michal Jamroz, Warsaw University, Poland
                                                                                                                                                    • Andrzej Kolinski, Warsaw University, Poland
                                                                                                                                                    • Daisuke Kihara, Purdue University, United States

                                                                                                                                                    Area Session Chair: Jianlin Cheng

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Comparing protein tertiary structures is a fundamental procedure in structural biology and protein bioinformatics. Structure comparison is important particularly for evaluating computational protein structure models. Most of the model structure evaluation methods perform rigid body superimposition of a structure model to its crystal structure and measure the difference of the corresponding residue or atom positions between them. However, these methods neglect intrinsic flexibility of proteins by treating the native structure as a rigid molecule. Since different parts of proteins have different levels of flexibility, for example, exposed loop regions are usually more flexible than the core region of a protein structure, disagreement of a model to the native need to be evaluated differently depending on the flexibility of residues in a protein.
                                                                                                                                                    Results: We propose a score named FlexScore for comparing protein structures that considers flexibility of each residue in the native state of proteins. Flexibility information may be extracted from experiments such as NMR or molecular dynamics simulation. FlexScore considers an ensemble of conformations of a protein described as a multivariate Gaussian distribution of atomic displacements and compares a query computational model to the ensemble. We compare FlexScore with other commonly used structure similarity scores over various examples. FlexScore agrees with experts’ intuitive assessment of computational models and provide information of practical usefulness of models.

                                                                                                                                                    TP034 (PT) - Identification of essential molecular and cellular processes controlling the response time and intensity of inflammation
                                                                                                                                                    Date: Sunday, July 10 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: SYSTEMS / DISEASE
                                                                                                                                                    • Alexander Mitrophanov, Department of Defense Biotechnology High Performance Computing Software Applications Institute, United States
                                                                                                                                                    • Sridevi Nagaraja, Department of Defense Biotechnology High Performance Computing Software Applications Institute, United States
                                                                                                                                                    • Jaques Reifman, Department of Defense Biotechnology High Performance Computing Software Applications Institute, United States

                                                                                                                                                    Area Session Chair: Hagit Shatkay

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Pathological inflammation, including inflammatory response with exaggerated intensity (sepsis) or with delayed resolution (chronic inflammation), has defied attempts at efficacious treatment. Here, we developed and applied a computational strategy to demonstrate how specific molecular and cellular components can be manipulated to achieve targeted modulation of the inflammatory response time and intensity. The strategy was based on comprehensive sensitivity and correlation analyses using our recently developed kinetic model that can represent thousands of possible inflammation scenarios. We identified three molecular mediators whose inhibition may robustly restore pathological inflammation to its normal course. We found that inflammation timing was more difficult to control than its intensity. Yet, simultaneous inhibition of two distinct targets suggested a reliable means to normalize both excessively strong and abnormally prolonged inflammatory responses. Our model was validated with existing experimental data and suggested new in vivo experiments.

                                                                                                                                                    TP035 (PT) - Robust discrimination of cell types from tissue expression profiles
                                                                                                                                                    Date: Sunday, July 10 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Aaron M. Newman, Stanford University, United States
                                                                                                                                                    • Andrew J. Gentles, Stanford University, United States
                                                                                                                                                    • Chih Long Liu, Stanford University, United States
                                                                                                                                                    • Michael R. Green, University of Nebraska Medical Center, United States
                                                                                                                                                    • Weiguo Feng, Stanford University, United States
                                                                                                                                                    • Scott V. Bratman, University of Toronto, Canada
                                                                                                                                                    • Dongkyoon Kim, Stanford University, United States
                                                                                                                                                    • Yue Xu, Stanford University, United States
                                                                                                                                                    • Amanda Khuong, Stanford University, United States
                                                                                                                                                    • Chuong D. Hoang, National Cancer Institute, United States
                                                                                                                                                    • Viswam S. Nair, Stanford University, United States
                                                                                                                                                    • Robert B. West, Stanford University, United States
                                                                                                                                                    • Sylvia K. Plevritis, Stanford University, United States
                                                                                                                                                    • Maximilian Diehn, Stanford University, United States
                                                                                                                                                    • Ash A. Alizadeh, Stanford University, United States

                                                                                                                                                    Area Session Chair: Paul Horton

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Changes in cellular composition underlie diverse physiological states. While flow cytometry and immunohistochemistry are commonly used to characterize tissue heterogeneity, the former requires cell dissociation, which can alter representation, while the latter is generally limited to one marker per section. To complement these methods, we developed CIBERSORT, an in silico deconvolution approach that robustly enumerates cell subsets of interest from gene expression profiles (GEPs) of bulk tissues. We evaluated CIBERSORT using fresh, frozen, and fixed specimens, including solid tumors, and found that it outperforms previous deconvolution methods with respect to noise, unknown mixture content, and closely related cell types. When applied to GEPs from 25 tumor types in a pan-cancer analysis, CIBERSORT revealed complex associations between 22 tumor-infiltrating leukocyte subsets and clinical outcomes. Predictions linking specific immune phenotypes to survival were validated in lung adenocarcinoma. CIBERSORT provides a novel platform for tissue characterization without requiring antibodies, disaggregation, or living cells.

                                                                                                                                                    TP036 (PT) - Investigating molecular determinants of ebolavirus pathogenicity
                                                                                                                                                    Date: Sunday, July 10 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / PROTEINS
                                                                                                                                                    • Morena Pappalardo, University of Kent, United Kingdom
                                                                                                                                                    • Miguel Juliá, University of Kent, United Kingdom
                                                                                                                                                    • Mark Howard, University of Kent, United Kingdom
                                                                                                                                                    • Jeremy Rossman, University of Kent, United Kingdom
                                                                                                                                                    • Martin Michaelis, University of Kent, United Kingdom
                                                                                                                                                    • Mark Wass, University of Kent, United Kingdom

                                                                                                                                                    Area Session Chair: Jianlin Cheng

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The West Africa Ebola virus outbreak has killed thousands of people and demonstrated the scale on which the virus threatens human life. Using extensive sequencing data obtain during the outbreak, we compare Ebolavirus genomes to identify potential molecular determinants of Ebolavirus pathogenicity. Of the five Ebolavirus species, only Reston viruses are not pathogenic in humans. We compared the Reston virus genome with those from the four human pathogenic species to identify specificity determining positions (SDPs) that are differentially conserved and may therefore act as molecular determinants of pathogenicity. We initially identified 189 SDPs using 196 Ebolavirus genome sequences. We report a reduced number of SDPs using a much larger set of sequences from the current outbreak. Structural analysis was performed to identify SDPs that are likely to have alter protein structure and function and could be associated with pathogenicity. The most striking findings were in Ebolavirus proteins VP24 and VP40. Particularly SDPs present in VP24 are likely to impair binding to human karyopherin alpha proteins and therefore prevent inhibition of interferon signaling in repsosne to viral infection. VP24 is also critical for Ebolavirus adaptation to novel hosts, and as only a few SDPs distinguish Reston virus VP24 from VP24 of other Ebolaviruses, it is possible that human pathogenic Reston viruses may emerge.

                                                                                                                                                    TP037 (PT) - LINEs between species: Evolutionary dynamics of LINE-1 retrotransposons across the eukaryotic tree of life
                                                                                                                                                    Date: Monday, July 11 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Atma Ivancevic, The University of Adelaide, Australia
                                                                                                                                                    • Dan Kortschak, The University of Adelaide, Australia
                                                                                                                                                    • Terry Bertozzi, South Australian Museum, Australia
                                                                                                                                                    • David Adelson, University of Adelaide, Australia

                                                                                                                                                    Area Session Chair: Yana Bromberg

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    LINE-1 (L1) retrotransposons are dynamic elements. They have the potential to cause great genomic change by inserting copies of themselves throughout the genome, resulting in the duplication and rearrangement of regulatory DNA. Active L1, in particular, are often thought of as tightly constrained, homologous and ubiquitous elements with well-characterised domain organisation. For the past 30 years, model organisms have been used to define L1s as 6-8kb sequences containing a 5’-UTR, two open reading frames working harmoniously in cis, and a 3’-UTR with a polyA tail.
                                                                                                                                                    In this study, we demonstrate the remarkable and overlooked diversity of L1s via a comprehensive phylogenetic analysis of over 500 species from widely divergent branches of the tree of life. The rapid and recent growth of L1 elements in mammalian species is juxtaposed against their decline in plant species and complete extinction in most reptiles and insects. In fact, some of these previously unexplored mammalian species (e.g. snub-nosed monkey, minke whale) exhibit L1 retrotranspositional ‘hyperactivity’ far surpassing that of human or mouse. In contrast, non-mammalian L1s have become so varied that the current classification system seems to inadequately capture their structural characteristics. Our findings illustrate how both long-term inherited evolutionary patterns and random bursts of activity in individual species can significantly alter genomes, highlighting the importance of L1 dynamics in eukaryotes.

                                                                                                                                                    TP038 (PT) - Convolutional neural network architectures for predicting DNA-protein binding
                                                                                                                                                    Date: Monday, July 11 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DATA / PROTEINS
                                                                                                                                                    • Haoyang Zeng, Massachusetts Institute of Technology, United States
                                                                                                                                                    • Matthew Edwards, MIT, United States
                                                                                                                                                    • Ge Liu, MIT, United States
                                                                                                                                                    • David Gifford, MIT, United States

                                                                                                                                                    Area Session Chair: Bruno Gaeta

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Convolutional neural networks (CNN)
                                                                                                                                                    have outperformed conventional methods in modeling the sequence
                                                                                                                                                    specificity of DNA-protein binding. Yet inappropriate CNN
                                                                                                                                                    architectures can yield poorer performance than simpler models. Thus
                                                                                                                                                    an in-depth understanding of how to match CNN architecture to a
                                                                                                                                                    given task is needed to fully harness the power of CNNs for
                                                                                                                                                    computational biology applications. We present
                                                                                                                                                    a systematic exploration of CNN architectures for predicting DNA
                                                                                                                                                    sequence binding using a large compendium of transcription factor
                                                                                                                                                    datasets. We identify the best-performing architectures by varying
                                                                                                                                                    CNN width, depth, and pooling designs. We find that adding
                                                                                                                                                    convolutional kernels to a network is important for motif-based
                                                                                                                                                    tasks. We show the benefits of CNNs in learning rich higher-order
                                                                                                                                                    sequence features, such as secondary motifs and local sequence
                                                                                                                                                    context, by comparing network performance on multiple modeling tasks
                                                                                                                                                    ranging in difficulty. We also demonstrate how careful construction
                                                                                                                                                    of sequence benchmark datasets, using approaches that control
                                                                                                                                                    potentially confounding effects like positional or motif strength
                                                                                                                                                    bias, is critical in making fair comparisons between competing
                                                                                                                                                    methods. We explore how to establish the sufficiency of training
                                                                                                                                                    data for these learning tasks, and we have created a flexible
                                                                                                                                                    cloud-based framework that permits the rapid exploration of
                                                                                                                                                    alternative neural network architectures for problems in
                                                                                                                                                    computational biology.

                                                                                                                                                    TP039 (PT) - What Time is It? Deep Learning Approaches for Circadian Rhythms
                                                                                                                                                    Date: Monday, July 11 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Forest Agostinelli, University of California-Irvine, United States
                                                                                                                                                    • Nicholas Ceglia, University of California-Irvine, United States
                                                                                                                                                    • Babak Shahbaba, University of California-Irvine, United States
                                                                                                                                                    • Paolo Sassone-Corsi, University of California-Irvine, United States
                                                                                                                                                    • Pierre Baldi, University of California-Irvine, United States

                                                                                                                                                    Area Session Chair: Nicola Mulder

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Circadian rhythms date back to the origins of life, are found in virtually every species and every cell, and play fundamental roles in functions ranging from metabolism to cognition. Modern high-throughput technologies allow the measurement of concentrations of transcripts, metabolites, and other species along the circadian cycle creating novel computational challenges and opportunities, including the problems of inferring whether a given species oscillate in circadian fashion or not, and inferring the time at which a set of measurements was taken.

                                                                                                                                                    Results: We first curate several large synthetic and biological time series data sets containing labels for both periodic and aperiodic signals. We then use deep learning methods to develop and train BIO_CYCLE, a system to robustly estimate which signals are periodic in high-throughput circadian experiments, producing estimates of amplitudes, periods, phases, as well as several statistical significance measures. Using the curated data, BIO_CYCLE is compared to other approaches and shown to achieve state-of-the-art performance across multiple metrics. We then use deep learning methods to develop and train BIO_CLOCK to robustly estimate the time at which a particular single-time-point transcriptomic experiment was carried. In most cases, BIO_CLOCK can reliably predict time, within approximately one hour, using the expression levels of only a small number of core clock genes.
                                                                                                                                                    BIO_CLOCK is shown to work reasonably well across tissue types, and often with only small degradation across conditions. BIO_CLOCK is used to annotate most mouse experiments found in the GEO database with an inferred time stamp.

                                                                                                                                                    Availability: All data and software are publicly available on the CircadiOmics web portal: circadiomics.igb.uci.edu/.

                                                                                                                                                    TP040 (PT) - phRAIDER: Pattern-Hunter Based Rapid Ab Initio Detection of Elementary Repeats
                                                                                                                                                    Date: Monday, July 11 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Charlotte Schaeffer, Miami University, United States
                                                                                                                                                    • Nathan Figueroa, Miami University, United States
                                                                                                                                                    • Xiaolin Liu, Miami University (Ohio), United States
                                                                                                                                                    • John Karro, Miami University (Ohio), United States

                                                                                                                                                    Area Session Chair: Yana Bromberg

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Transposable Elements and repetitive DNA make up a sizable fraction of Eukaryotic genomes, and their annotation is crucial to the study of the structure, organization, and evolution of any newly sequenced genome. While RepeatMasker and nHMMER are useful for identifying these repeats, they require a pre-compiled repeat library -- which is not always available. {\it De novo} tools such as Recon, RepeatScout, or RepeatGluer serve to identify TEs purely from sequence content, but are either limited by runtimes that prohibit whole-genome use or degrade in quality in the presence of substitutions that disrupt the sequence patterns.

                                                                                                                                                    Results: phRAIDER is an de novo transposable element tool that addresses both the issue of of runtime without sacrificing sensitivity, as compared to competing tools. The underlying model is a new definition of elementary repeats that incorporates the PatternHunter spaced seed model, allowing for greater sensitivity in the presence of genomic substitutions. As compared to the premier tool in the literature, RepeatScout, phRAIDER shows an average 10x speedup on any single human chromosome and has the ability to process the whole human genome in just over three hours. Here we present the tool, the theoretical model underlying the tool, and the results demonstrating its effectiveness.

                                                                                                                                                    Availability: phRAIDER is an open source tool available from https://github.com/karroje/phRAIDER.

                                                                                                                                                    TP041 (PT) - RCK: accurate and efficient inference of sequenceand structure-based protein-RNA binding models from RNAcompete data
                                                                                                                                                    Date: Monday, July 11 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DATA / GENES
                                                                                                                                                    • Yaron Orenstein, MIT, United States
                                                                                                                                                    • Yuhao Wang, MIT, United States
                                                                                                                                                    • Bonnie Berger, MIT, United States

                                                                                                                                                    Area Session Chair: Bruno Gaeta

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Protein-RNA interactions, which play vital roles in many processes, are mediated through both RNA sequence and structure. CLIP-based methods, which measure protein-RNA binding in vivo, suffer from experimental noise and systematic biases, whereas in vitro experiments capture a clearer signal of protein RNA-binding. Among them, RNAcompete provides binding affinities of a specific protein to more than 240,000 unstructured RNA probes in one experiment. The computational challenge is to infer RNA structure- and sequence-based binding models from these data. The state-of-the-art in sequence models, Deepbind, does not model structural preferences. RNAcontext models both sequence and structure preferences, but was outperformed by GraphProt. Unfortunately, GraphProt cannot detect structural preferences from RNAcompete data due to the unstructured nature of the data, as noted by its developers.
                                                                                                                                                    Results: We develop RCK, an efficient, scalable algorithm to infer sequence and structure preferences based on a new k-mer model. Remarkably, even though RNAcompete data is designed to be unstructured, RCK can still learn structural preferences from it. RCK significantly outperforms both RNAcontext and Deepbind in in vitro binding prediction for 244 RNAcompete experiments. Moreover, RCK is also faster and uses less memory, which enables scalability. While currently on par with existing methods in in vivo binding prediction on a small scale test, we demonstrate that RCK will increasingly benefit from experimentally measured RNA structure profiles as compared to computationally predicted ones. By running RCK on the entire RNAcompete dataset, we generate and provide as a resource a set of protein-RNA structure-based models on an unprecedented scale.
                                                                                                                                                    Availability: Software and models are freely available at http://groups.csail.mit.edu/cb/rck/.
                                                                                                                                                    Contact: bab@mit.edu
                                                                                                                                                    Supplementary information: Supplementary data are available at Bioinformatics online.

                                                                                                                                                    TP042 (PT) - Core Regulatory Circuitry of the Plant Circadian System
                                                                                                                                                    Date: Monday, July 11 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Mathias Foo, University of Warwick, United Kingdom
                                                                                                                                                    • David Somers, The Ohio State University, United States
                                                                                                                                                    • Pan-Jun Kim, Asia Pacific Center for Theoretical Physics, Korea, Republic of

                                                                                                                                                    Area Session Chair: Nicola Mulder

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Sleep/wake cycles in animals exemplify daily biological rhythms driven by internal molecular clocks, circadian clocks, which are important for plant life as well. The plant circadian clock is much more complex than any other organisms, eluding our understanding of its design principle. Based on the mechanistic modeling and simulation of Arabidopsis thaliana, we successfully identified a kernel of the plant circadian system, the critical gene regulatory circuitry for clock function. The kernel integrates four major negative feedback loops for molecular circadian oscillations. Strikingly, the kernel structure, as well as the whole clock circuitry, was found to be overwhelmingly composed of inhibitory, not activating, interactions among genes. This fact facilitates the global coordination of plant circadian molecular profiles to often exhibit sharply-shaped, cuspidate waveforms, which indicate clock events that are markedly peaked at very specific times of day. Our approach elucidates a design principle of biological clockwork, implicated in synthetic biology.

                                                                                                                                                    TP043 (PT) - DNA editing of LTR retrotransposons reveals the impact of APOBECs on vertebrate genomes
                                                                                                                                                    Date: Monday, July 11 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Binyamin Knisbacher, The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Israel
                                                                                                                                                    • Erez Levanon, The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Israel

                                                                                                                                                    Area Session Chair: Yana Bromberg

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    LTR retrotransposons are retrovirus-like entities widespread in vertebrate genomes. These replicating endogenous retroviruses (ERVs) must be restricted to prevent deleterious mutations and maintain genome integrity. The APOBEC DNA-editing enzymes can do so by inflicting C-to-U hypermutation in retrotransposon DNA during their mobilization. In some cases, hypermutated retrotransposons successfully integrate into the genome, introducing unique sequences, which increase retrotransposon diversity and the probability of developing new function at the loci of insertion. We developed a computational approach to identify such events, applied it to genomes of 123 diverse species and identified numerous DNA edited sites in humans and various vertebrate lineages. Unexpectedly, DNA editing is exceptionally prevalent in some birds, including one of Darwin's finches. Edited ERVs are enriched in genic regions, thereby raising the probability of their exaptation for novel function. Our results show that DNA editing has a substantial role in vertebrate innate immunity and may accelerate genome evolution.

                                                                                                                                                    TP044 (PT) - Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
                                                                                                                                                    Date: Monday, July 11 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: DATA / PROTEINS
                                                                                                                                                    • Hannes Bretschneider, University of Toronto,
                                                                                                                                                    • Brendan Frey, University of Toronto, Canada
                                                                                                                                                    • Andrew Delong, Deep Genomics, Canada
                                                                                                                                                    • Babak Alipanahi, University of Toronto, Canada

                                                                                                                                                    Area Session Chair: Bruno Gaeta

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with ‘deep learning’ techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a ‘mutation map’ that indicates how variations affect binding within a specific sequence.

                                                                                                                                                    TP045 (PT) - A Framework for Integrating Co-expression Networks with GWAS to Prioritize Candidate Genes in Maize
                                                                                                                                                    Date: Monday, July 11 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Robert Schaefer, University of Minnesota, United States
                                                                                                                                                    • Jean-Michel Michno, University of Minnesota, United States
                                                                                                                                                    • Joseph Jeffers, University of Minnesota, United States
                                                                                                                                                    • Owen Hoekenga, Independent Consultant, United States
                                                                                                                                                    • Brian Dilkes, Purdue University, United States
                                                                                                                                                    • Ivan Baxter, Donald Danforth Plant Science Center/6USDA-ARS Plant Genetics Research Unit, United States
                                                                                                                                                    • Chad Myers, University of Minnesota, United States

                                                                                                                                                    Area Session Chair: Nicola Mulder

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Genome wide association studies (GWAS) have identified thousands of loci linked to hundreds of traits in many different species. However, in many cases, the causal genes and the cellular processes they contribute to remain unknown. This problem is even more pronounced in non-model species where functional annotations are sparse. To address these issues, we developed a computational framework called Camoco (Co-Analysis of Molecular Components) that systematically integrates loci identified by GWAS with gene co-expression networks to identify a focused set of putative causal genes that are coordinately regulated. We demonstrate the utility of our approach on new GWAS studies in maize, the world’s most produced staple crop. Using our approach, candidate SNPs associated with elemental accumulation in maize kernels were reduced by two orders of magnitude. Our study reveals the importance of gene expression data context as only root tissue-specific co-expression networks based on gene expression signatures across genotypically diverse individuals were able to provide signal for interpreting GWAS candidate SNPs. Both the software tools we developed and the lessons on integrating GWAS data with co-expression networks generalize to other contexts.

                                                                                                                                                    TP046 (PT) - Read-Based Phasing of Related Individuals
                                                                                                                                                    Date: Monday, July 11 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / SYSTEMS
                                                                                                                                                    • Shilpa Garg, MPI-INF, Germany, Germany
                                                                                                                                                    • Marcel Martin, Science for Life Laboratory, Sweden
                                                                                                                                                    • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

                                                                                                                                                    Area Session Chair: Yana Bromberg

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information - reads and pedigree - has the potential to deliver results better than each individually.
                                                                                                                                                    Results: We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2x for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15x coverage per individual.

                                                                                                                                                    TP047 (PT) - Revisiting the computational analysis of DNase sequencing
                                                                                                                                                    Date: Monday, July 11 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Ivan G. Costa, RWTH Aachen Universtiy, Germany
                                                                                                                                                    • Eduardo Gadde Gusmao, RWTH Aachen Universtiy, Germany
                                                                                                                                                    • Manuel Allhoff, RWTH Aachen Universtiy, Germany
                                                                                                                                                    • Martin Zenke, RWTH Aachen University, Germany

                                                                                                                                                    Area Session Chair: Bruno Gaeta

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    DNase-seq is a powerful technique for detection of cell-specific binding sites in a genome-wide manner. Computational footprinting methods, which search for footprint-like DNase I cleavage patterns on the DNA, allow the detection of binding sites in a base pair resolution. There is, however, a debate in the literature on the influence of experimental artifacts as DNase I cleavage bias and transcription factor residence time on computational footprint methods. We investigated these artifacts in a comprehensive panel of DNase-seq data sets, 10 footprinting methods and 88 transcription factors. Our comparative analysis indicates the advantage of HINT, DNase2TF and PIQ in relation to other footprinting methods. We demonstrate that correcting the DNase-seq signal based on cleavage bias estimation significantly improves accuracy of computational footprinting. We also propose a score to detect footprints arising from transcription factors with short residence time, as footprints of such factors have low predictive performance.

                                                                                                                                                    TP048 (PT) - Novel Applications of Multi-task Learning and Multiple Output Regression to Multiple Genetic Trait Prediction
                                                                                                                                                    Date: Monday, July 11 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES / DATA
                                                                                                                                                    • Dan He, IBM T.J. Watson, United States
                                                                                                                                                    • Laxmi Parida, IBM T J Watson Research Center, United States

                                                                                                                                                    Area Session Chair: Nicola Mulder

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Given a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models. In many cases, for the same set of samples and markers, multiple traits are observed. Some of these traits might be correlated with each other. Therefore, modeling all the multiple traits together may improve the prediction accuracy. In this work, we view the multi-trait prediction problem from a machine learning angle: as either a multi-task learning problem or a multiple output regression problem, depending on whether different traits share the same genotype matrix or not. We then adapted multi-task learning algorithms and multiple output regression algorithms to solve the multi-trait prediction problem. We proposed a few strategies to improve the least square error of the prediction from these algorithms. Our experiments show that modeling multiple traits together could improve the prediction accuracy for correlated traits.

                                                                                                                                                    TP049 (PT) - An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree
                                                                                                                                                    Date: Monday, July 11 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / SYSTEMS
                                                                                                                                                    • Yufeng Wu, Computer Science and Engineering Department, University of Connecticut, United States

                                                                                                                                                    Area Session Chair: Yana Bromberg

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Gene tree represents the evolutionary history of gene
                                                                                                                                                    lineages that originate from multiple related populations. Under the
                                                                                                                                                    multispecies coalescent model, lineages may coalesce outside the
                                                                                                                                                    species (population) boundary. Given a species tree (with branch
                                                                                                                                                    lengths), the gene tree probability is the probability of observing a
                                                                                                                                                    specific gene tree topology under the multispecies coalescent model.
                                                                                                                                                    There are two existing algorithms for computing the exact gene tree
                                                                                                                                                    probability. The first algorithm is due to Degnan and Salter (2005),
                                                                                                                                                    where they enumerate all the so-called coalescent histories for the
                                                                                                                                                    given species tree and the gene tree topology. Their algorithm runs
                                                                                                                                                    in exponential time in the number of gene lineages in general. The
                                                                                                                                                    second algorithm is the STELLS algorithm (2012), which is usually
                                                                                                                                                    faster but also runs in exponential time in almost all the cases.

                                                                                                                                                    Results: In this paper, we present a new algorithm, called
                                                                                                                                                    CompactCH, for computing the exact gene tree probability. This new
                                                                                                                                                    algorithm is based on the notion of compact coalescent histories:
                                                                                                                                                    multiple coalescent histories are represented by a single compact
                                                                                                                                                    coalescent history. The key advantage of our new algorithm is that it
                                                                                                                                                    runs in polynomial time in the number of gene lineages if the number
                                                                                                                                                    of populations is fixed to be a constant. The new algorithm is more
                                                                                                                                                    efficient than the STELLS algorithm both in theory and in practice
                                                                                                                                                    when the number of populations is small and there are multiple
                                                                                                                                                    gene lineages from each population. As an application, we show
                                                                                                                                                    that CompactCH can be applied in the inference of population tree
                                                                                                                                                    (i.e. the population divergence history) from population haplotypes.
                                                                                                                                                    Simulation results show that the CompactCH algorithm enables
                                                                                                                                                    efficient and accurate inference of population trees with much more
                                                                                                                                                    haplotypes than a previous approach.

                                                                                                                                                    Availability: The CompactCH algorithm is implemented in the
                                                                                                                                                    STELLS software package, which is available for download at http:
                                                                                                                                                    //www.engr.uconn.edu/~ywu/STELLS.html.

                                                                                                                                                    Contact: ywu@engr.uconn.edu

                                                                                                                                                    TP050 (PT) - The Role of Genome Accessibility in Transcription Factor Binding in Bacteria
                                                                                                                                                    Date: Monday, July 11 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: GENES / PROTEINS
                                                                                                                                                    • Antonio Gomes, Columbia University, United States
                                                                                                                                                    • Harris Wang, Columbia UNIVERSITY, United States

                                                                                                                                                    Area Session Chair: Bruno Gaeta

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    ChIP-seq enables genome-scale identification of regulatory regions that govern gene expression. However, the biological insights generated from ChIP-seq analysis have been limited to predictions of binding sites and cooperative interactions. Furthermore, ChIP-seq data often poorly correlate with in vitro measurements or predicted motifs, highlighting that binding affinity alone is insufficient to explain transcription factor (TF)-binding in vivo. One possibility is that binding sites are not equally accessible across the genome. A more comprehensive biophysical representation of TF-binding is required to improve our ability to understand, predict, and alter gene expression. Here, we show that genome accessibility is a key parameter that impacts TF-binding in bacteria. We developed a thermodynamic model that parameterizes ChIP-seq coverage in terms of genome accessibility and binding affinity. The role of genome accessibility is validated using a large-scale ChIP-seq dataset of the M. tuberculosis regulatory network. We find that accounting for genome accessibility led to a model that explains 63% of the ChIP-seq profile variance, while a model based in motif score alone explains only 35% of the variance. Moreover, our framework enables de novo ChIP-seq peak prediction and is useful for inferring TF-binding peaks in new experimental conditions by reducing the need for additional experiments. We observe that the genome is more accessible in intergenic regions, and that increased accessibility is positively correlated with gene expression and anti-correlated with distance to the origin of replication. Our biophysical model provides a more comprehensive description of TF-binding in vivo from first principles towards a better representation of gene regulation in silico, with promising applications in systems biology.

                                                                                                                                                    TP051 (PT) - A Network-driven Approach for Genome-wide Association Mapping
                                                                                                                                                    Date: Monday, July 11 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES / DISEASE
                                                                                                                                                    • Seunghak Lee, Carnegie Mellon University, United States
                                                                                                                                                    • Soonho Kong, Carnegie Mellon University, United States
                                                                                                                                                    • Eric Xing, Carnegie Mellon University, United States

                                                                                                                                                    Area Session Chair: Nicola Mulder

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation:

                                                                                                                                                    It remains a challenge to detect associations between genotypes and phenotypes because of insufficient sample sizes and complex underlying mechanisms involved in associations. Fortunately, it is becoming more feasible to obtain gene expression data in addition to genotypes and phenotypes, giving us new opportunities to detect true genotype-phenotype associations while unveiling their association mechanisms.

                                                                                                                                                    Results:

                                                                                                                                                    In this paper, we propose a novel method, NETAM, that accurately detects associations between SNPs and phenotypes, as well as gene traits involved in such associations. We take a network-driven approach: NETAM first constructs an association network, where nodes represent SNPs, gene traits, or phenotypes, and edges represent the strength of association between two nodes. NETAM assigns a score to each path from an SNP to a phenotype, and then identifies significant paths based on the scores. In our simulation study, we show that NETAM finds significantly more phenotype-associated SNPs than traditional genotype-phenotype association analysis under false positive control, taking advantage of gene expression data. Furthermore, we applied NETAM on late-onset Alzheimer's disease data and identified 477 significant path associations, among which we analyzed paths related to beta-amyloid, estrogen, and nicotine pathways. We also provide hypothetical biological pathways
                                                                                                                                                    to explain our findings.

                                                                                                                                                    TP052 (PT) - Deciphering evolutionary strata on plant sex chromosomes and fungal mating-type chromosomes through compositional segmentation
                                                                                                                                                    Date: Monday, July 11 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / SYSTEMS
                                                                                                                                                    • Rajeev Azad, University of North Texas, United States
                                                                                                                                                    • Ravi Shanker Pandey, University of North Texas, United States

                                                                                                                                                    Area Session Chair: Yana Bromberg

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Abstract:
                                                                                                                                                    Sex chromosomes have evolved from a pair of homologous autosomes which differentiated into sex determination systems, such as XY or ZW systems, as a consequence of successive recombination suppression between gametologous chromosomes. To identify regions of recombination suppression, the “evolutionary strata”, even when only the sequence of sex chromosome in the homogametic sex (i.e. X or Z chromosome) is available, we have developed an integrated segmentation and clustering method. In order to understand the early evolution of sex chromosomes, we applied our method to recently evolved plant sex chromosomes. Our method could decipher all known evolutionary strata on papaya and Silene latifolia X chromosomes, and decipheried two, yet unknown, evolutionary strata on an incipient sex chromosome of Populus trichocarpa. Application to sex chromosome V of brown alga Ectocarpus sp. recovered sex determining and pseudoautosomal regions, and application to mating-type chromosomes of an anther-smut fungus Microbotryum lychnidis-dioicae uncovered five new strata.

                                                                                                                                                    Justification:
                                                                                                                                                    Evolution of sex chromosomes in animals and birds is relatively well-studied than in plants, although 48 dioecious plants have already been reported. A key aspect in understanding sex chromosome evolution is to decipher the successive regions of recombination suppression between the gametologous sex chromosomes. However, until now, only two plants Silene latifolia and papaya have been examined for the recombination suppressed regions, namely, the evolutionary strata, on their X chromosomes. This was made possible by sequencing of sex-linked genes on both X and Y chromosomes, which is a requirement of all current methods that determine strata structure based on comparison of gametologous sex chromosomes. To circumvent this limitation and detect strata even in the absence of Y chromosome sequence, we have developed an integrated segmentation and clustering method, which could recapitulate the previously identified strata on the Silene latifolia and papaya X chromosomes without X-Y comparison, and deciphered two, yet unknown, strata on an incipient sex chromosome of Populus trichocarpa.

                                                                                                                                                    Emergence and evolution of sex chromosomes in many plants are much recent than the mammalian sex chromosome histories, and therefore, our approach provides a much needed tool for understanding early evolution of sex chromosomes using dioecious plants as model systems. The paucity of heterogametic sex chromosome sequence (Y or W sequence) makes our approach even more relevant, and perhaps the only available tool, for understanding the sex chromosome evolution without being constrained by the unavailability of Y or W sequence, or by the loss of Y-linked or W-linked genes.

                                                                                                                                                    TP053 (PT) - Predicting effects of noncoding variants with deep learning-based sequence model
                                                                                                                                                    Date: Monday, July 11 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: GENES / DATA
                                                                                                                                                    • Jian Zhou, Princeton University, United States
                                                                                                                                                    • Olga Troyanskaya, Princeton University, United States

                                                                                                                                                    Area Session Chair: Bruno Gaeta

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning-based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.

                                                                                                                                                    TP054 (PT) - Integrative genomics analyses unveil downstream biological effectors of disease-specific polymorphisms buried in intergenic regions
                                                                                                                                                    Date: Monday, July 11 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES / DISEASE
                                                                                                                                                    • Haiquan Li, University of Arizona, United States
                                                                                                                                                    • Ikbel Achour, University of Arizona Center for Biomedical Informatics and Biostatistics, United States
                                                                                                                                                    • Lisa Bastarache, Vanderbilt University, United States
                                                                                                                                                    • Joanne Berghout, The University of Arizona, United States
                                                                                                                                                    • Vincent Gardeux, The University of Illinois at Chicago, France
                                                                                                                                                    • Jianrong Li, University of Arizona, United States
                                                                                                                                                    • Younghee Lee, University of Utah, United States
                                                                                                                                                    • Lorenzo Pesce, The University of Chicago, United States
                                                                                                                                                    • Xinan Yang, the University of Chicago, United States
                                                                                                                                                    • Kenneth Ramos, The University of Arizona, United States
                                                                                                                                                    • Ian Foster, Argonne National Laboratory & The University of Chicago, United States
                                                                                                                                                    • Joshua Denny, Vanderbilt University, United States
                                                                                                                                                    • Jason Moore, University of Pennsylvania, United States
                                                                                                                                                    • Yves Lussier, The University of Arizona, United States

                                                                                                                                                    Area Session Chair: Nicola Mulder

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Altered biological mechanisms arising from disease-associated polymorphisms, remain difficult to characterize when those variants are intergenic. We developed a computational method that identifies shared downstream mechanisms by which inter- and intragenic SNPs contribute to a specific physiopathology. Modelling 2,000,000 pairs of disease-associated SNPs (GWAS) with eQTL and Gene Ontology functional annotations, we predicted 3,870 inter-intra and inter-intra SNP-pairs with convergent biological mechanisms (FDR<0.05). These SNP-pairs with overlapping mRNA targets or similar functional annotations were more associated with the same disease than unrelated pathologies (OR>12). We independently confirmed synergistic and antagonistic genetic interactions for prioritized SNP-pairs of Alzheimer’s (p=0.046), cancer (p=0.039), and rheumatoid arthritis (p<10-4). Using ENCODE, we validated that the biological mechanisms shared within prioritized SNP-pairs are frequently governed by matching transcription factor binding sites and long-range chromatin interactions. These results provide a “roadmap” of disease mechanisms emerging from GWAS and further identify downstream candidate therapeutic targets of intergenic SNPs.

                                                                                                                                                    TP055 (PT) - DeepMeSH: Deep Semantic Representation for Improving Large-scale MeSH Indexing
                                                                                                                                                    Date: Monday, July 11 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere BCD
                                                                                                                                                    Theme: DATA
                                                                                                                                                    • Shengwen Peng, Fudan University, China
                                                                                                                                                    • Ronghui You, Fudan University, China
                                                                                                                                                    • Hongning Wang, Department of Computer Science at University of Virginia, United States
                                                                                                                                                    • Chengxiang Zhai, UIUC, United States
                                                                                                                                                    • Hiroshi Mamitsuka, Kyoto University, Japan
                                                                                                                                                    • Shanfeng Zhu, Fudan University, China

                                                                                                                                                    Area Session Chair: Russell Schwartz

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation:
                                                                                                                                                    Medical Subject Headings (MeSH) indexing, which is to assign a
                                                                                                                                                    set of MeSH main headings to citations, is crucial for many
                                                                                                                                                    important tasks in biomedical text mining and information retrieval.
                                                                                                                                                    Large-scale MeSH indexing has two challenging aspects: the citation side and
                                                                                                                                                    MeSH side.
                                                                                                                                                    For the citation side, all existing methods, including Medical Text
                                                                                                                                                    Indexer (MTI) by NLM (National Library of Medicine) and the
                                                                                                                                                    state-of-the-art method, MeSHLabeler, deal with text by bag-of-words,
                                                                                                                                                    which cannot capture semantic and context-dependent information well.

                                                                                                                                                    Methods: We propose DeepMeSH that incorporates deep semantic
                                                                                                                                                    information for large-scale MeSH indexing.
                                                                                                                                                    It addresses the two challenges in both citation and MeSH sides.
                                                                                                                                                    The citation side challenge is solved by a new deep semantic representation,
                                                                                                                                                    D2V-TFIDF, which concatenates both sparse and dense semantic representations.
                                                                                                                                                    The MeSH side challenge is solved by using the `learning to rank' framework of
                                                                                                                                                    MeSHLabeler, which integrates various types of evidence generated from
                                                                                                                                                    the new semantic representation.

                                                                                                                                                    Results:
                                                                                                                                                    DeepMeSH achieved a Micro F-measure of 0.6323, 2\% higher than 0.6218
                                                                                                                                                    of MeSHLabeler and 12\% higher than 0.5637 of MTI, for BioASQ3 challenge
                                                                                                                                                    data with 6,000 citations.

                                                                                                                                                    TP056 (PT) - Alignment-free scaffolding of large genome drafts using long sequences and jumping library MPET reads
                                                                                                                                                    Date: Monday, July 11 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Rene Warren, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Lauren Coombe, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Sarah Yeo, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Chen Yang, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Justin Chu, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Austin Hammond, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Hamid Mohamadi, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Ben Vandervalk, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Erdi Kucuk, BC Cancer Agency, Genome Sciences Centre, Canada
                                                                                                                                                    • Inanc Birol, BC Cancer Agency, Genome Sciences Centre, Canada

                                                                                                                                                    Area Session Chair: Pedja Radivojac

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    =====150 word description of the presentation

                                                                                                                                                    Over the past months, single-molecule long-reads from established and emerging technologies have proven valuable to the assembly of complete bacterial draft genomes, and to help track viral outbreaks. At the moment, the use of those technologies on their own is still too often costly for de novo assembly of mammalian-size genomes. Last year, we demonstrated that despite the lower base accuracy associated with long-read sequencing platforms, they are indisputably effective for scaffolding small and large high-quality draft genomes, as it increases the contiguity and completeness of low-cost assemblies, and thereby reduces the complexity of genome drafts. During the course of the year, a new read-linking technology from 10X Genomics has emerged, and holds promise for genome scaffolding. We will present advances in scaffolding and genome finishing, describing further developments to the LINKS scaffolder and how we applied these technologies to the large genomes of American bullfrog and spruce.

                                                                                                                                                    =====250 word justification-like argument

                                                                                                                                                    We submit the enclosed manuscript entitled “LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads” for consideration as a presentation for the highlights track of ISMB.
                                                                                                                                                    Long sequence reads are of prime importance to genome assembly, which is in turn cornerstone to genome characterization. Although long reads from existing and upcoming technologies still have ways to go before being used routinely in de novo genome assembly projects, their utility for scaffolding existing good-quality assemblies is paramount. The scaffolding problem has been explored by many, including our group, but has only recently been applied to emerging long DNA sequence reads from Oxford Nanopore Technologies (ONT) Ltd.
                                                                                                                                                    In our presentation we discuss an effective and elegant method for genome scaffolding with long and imperfect sequences that use linked k-mers at set distance intervals. We present new developments since publication, including native scaffolding with jumping library (MPET) reads and the use of an improved Bloom filter to exclude erroneous k-mer pairs. We demonstrate that even low accuracy sequence data has tremendous potential for increasing genome assembly contiguity without the need for error correction or pre-processing, and show how our alignment-free solution scales up to large eukaryotic genomes.
                                                                                                                                                    We anticipate that this timely work will be of broad interest to ISMB attendees as the uptake of genomics in research labs and in the clinic increases with the affordability of DNA sequencing. We expect LINKS to have utility in helping assemble large genomes, as we enter the era of long DNA sequence reads.

                                                                                                                                                    TP057 (PT) - A Cross-Species Bi-Clustering Approach to Identifying Conserved Co-regulated Genes
                                                                                                                                                    Date: Monday, July 11 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: GENES / SYSTEMS
                                                                                                                                                    • Jiangwen Sun, University of Connecticut, United States
                                                                                                                                                    • Zongliang Jiang, University of Connecticut, United States
                                                                                                                                                    • X Cindy Tian, University of Connecticut, United States
                                                                                                                                                    • Jinbo Bi, University of Connecticut, United States

                                                                                                                                                    Area Session Chair: Reinhard Schneider

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: A growing number of studies have explored the process of pre-implantation embryonic development of multiple mammalian species. However, the conservation and variation among different species in their developmental programming are poorly defined due to the lack of effective computational methods for detecting co-regularized genes that are conserved across species. The most sophisticated method to date for identifying conserved co-regulated genes is a two-step approach. This approach first identifies gene clusters for each species by a cluster analysis of gene expression data, and subsequently computes the overlaps of clusters identified from different species to reveal common subgroups. This approach is ineffective to deal with the noise in the expression data introduced by the complicated procedures in quantifying gene expression. Furthermore, due to the sequential nature of the approach, the gene clusters identified in the first step may have little overlap among different species in the second step, thus difficult to detect conserved co-regulated genes.

                                                                                                                                                    Results: We propose a cross-species bi-clustering approach which first denoises the gene expression data of each species into a data matrix. The rows of the data matrices of different species represent the same set of genes that are characterized by their expression patterns over the developmental stages of each species as columns. A novel bi-clustering method is then developed to cluster genes into subgroups by a joint sparse rank-one factorization of all the data matrices. This method decomposes a data matrix into a product of a column vector and a row vector where the column vector is a consistent indicator across the matrices (species) to identify the same gene cluster and the row vector specifies for each species the developmental stages that the clustered genes co-regulate. Efficient optimization algorithm has been developed with convergence analysis. This approach was first validated on synthetic data and compared to the two-step method and several recent joint clustering methods. We then applied this approach to two real world datasets of gene expression during the pre-implantation embryonic development of human and mouse. Co-regulated genes consistent between the human and mouse were identified, offering insights into conserved functions, as well as similarities and differences in genome activation timing between human and mouse embryos.

                                                                                                                                                    Availability: The R package containing the implementation of the proposed method in C++ is available at: https://github.com/JavonSun/mvbc.git and also at the R platform https://www.r-project.org/.

                                                                                                                                                    TP058 (PT) - Candidate gene prioritization with Endeavour
                                                                                                                                                    Date: Monday, July 11 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Léon-Charles Tranchevent, , Laboratoire de Biologie et de Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, Université de Lyon, France
                                                                                                                                                    • Amin Ardeshirdavani, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
                                                                                                                                                    • Sarah Elshal, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
                                                                                                                                                    • Daniel Alcaide, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
                                                                                                                                                    • Jan Aerts, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
                                                                                                                                                    • Didier Auboeuf, , Laboratoire de Biologie et de Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, Université de Lyon, France
                                                                                                                                                    • Yves Moreau, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium

                                                                                                                                                    Area Session Chair: Judith Blake

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Genomic studies and high-throughput experiments often produce large lists of candidate genes among which only a few are truly relevant to the disease, phenotype, or biological process of interest. Gene prioritization tackles this problem by ranking candidate genes by profiling candidates across multiple genomic data sources and integrating this heterogenous information into a global ranking. We describe an extended version of our gene prioritization method, Endeavour, now available for 6 species and integrating 75 data sources. Validation of our results indicate that this extended version of Endeavour efficiently prioritizes candidate genes. The Endeavour web server is freely available at https://endeavour.esat.kuleuven.be/

                                                                                                                                                    TP059 (PT) - Translation of Genotype to Phenotype by a Hierarchy of Cell Subsystems
                                                                                                                                                    Date: Monday, July 11 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere BCD
                                                                                                                                                    Theme: DATA / SYSTEMS
                                                                                                                                                    • Michael Ku Yu, UCSD, United States
                                                                                                                                                    • Michael Kramer, UCSD, United States
                                                                                                                                                    • Janusz Dutkowski, UCSD, Data4Cure, United States
                                                                                                                                                    • Rohith Srivas, UCSD, Stanford University, United States
                                                                                                                                                    • Katherine Licon, UCSD, United States
                                                                                                                                                    • Jason F. Kreisberg, UCSD, United States
                                                                                                                                                    • Cherie Ng, aTyr Pharmaceuticals, United States
                                                                                                                                                    • Nevan Krogan, UCSF, United States
                                                                                                                                                    • Roded Sharan, Tel Aviv University, United States
                                                                                                                                                    • Trey Ideker, UCSD, United States

                                                                                                                                                    Area Session Chair: Russell Schwartz

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Accurately translating genotype to phenotype requires accounting for the functional impact of genetic variation at many biological scales. Here, we present a strategy for genotype-phenotype reasoning based on existing knowledge of cellular subsystems. These subsystems and their hierarchical organization are defined by the Gene Ontology or a complementary ontology inferred directly from previously published datasets. Guided by the ontology’s hierarchical structure, we organize genotype data into an “ontotype,” that is, a hierarchy of perturbations representing the effects of genetic variation at multiple cellular scales. The ontotype is then interpreted using logical rules generated by machine learning to predict phenotype. This approach substantially outperforms previous non-hierarchical methods for translating yeast genotype to cell growth phenotype, and it accurately predicts the growth outcomes of two new screens of 2,503 double gene knockouts affecting DNA repair or nuclear lumen. Ontotypes also generalize to larger knockout combinations, setting the stage for interpreting the complex genetics of disease.

                                                                                                                                                    TP060 (PT) - Genome assembly from synthetic long read clouds
                                                                                                                                                    Date: Monday, July 11 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Volodymyr Kuleshov, Stanford University, United States
                                                                                                                                                    • Michael Snyder, Stanford University, United States
                                                                                                                                                    • Serafim Batzoglou, Stanford University, United States

                                                                                                                                                    Area Session Chair: Pedja Radivojac

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Despite rapid progress in sequencing technology, assembling de-novo the genomes of new species as well as reconstructing complex metagenomes remain major technological challenges. New synthetic long read (SLR) technologies promise significant advances towards these goals; however, their applicability is limited by high sequencing requirements and the inability of current assembly paradigms to cope with combinations of short and long reads.
                                                                                                                                                    Results: Here, we introduce Architect, a new de-novo scaffolder aimed at synthetic long read technologies. Unlike previous assembly strategies, Architect does not require a costly subassembly step; instead it assembles genomes directly from the SLR’s underlying short reads, which we refer to as read clouds. This enables a 4 to 20 fold reduction in sequencing requirements and a five-fold increase in assembly contiguity on both genomic and metagenomic datasets relative to state-of-the-art assembly strategies aimed directly at fully-subassembled long reads.

                                                                                                                                                    TP061 (PT) - Structure-Based Prediction of Transcription Factor Binding Specificity using an Integrative Energy Function
                                                                                                                                                    Date: Monday, July 11 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Alvin Farrel, University of North Carolina at Charlotte, United States
                                                                                                                                                    • Jonathan Murphy, University of North Carolina at Charlotte, United States
                                                                                                                                                    • Jun-Tao Guo, University of North Carolina at Charlotte, United States

                                                                                                                                                    Area Session Chair: Reinhard Schneider

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Transcription factors (TFs) regulate gene expression through binding to specific target DNA sites. Accurate annotation of transcription factor binding sites (TFBSs) at genome scale represents an essential step toward our understanding of gene regulation networks. In this paper, we present a structure-based method for computational prediction of TFBSs using a novel, integrative energy function. The new energy function combines a multibody knowledge-based potential and two atomic energy terms (hydrogen bond and π-interaction) that might not be accurately captured by the knowledge-based potential due to the mean force nature and low count problem. We applied the new energy function to the TFBS prediction using a non-redundant dataset that consists of transcription factors from 12 different families. Our results show that the new integrative energy function improves the prediction accuracy over the knowledge-based, statistical potentials, especially for homeodomain transcription factors, the second largest TF family in mammals.

                                                                                                                                                    TP062 (PT) - Furthering understanding of human diseases through integrative cross-species analysis
                                                                                                                                                    Date: Monday, July 11 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Victoria Yao, Princeton University, United States
                                                                                                                                                    • Rachel Kaletsky, Princeton University, United States
                                                                                                                                                    • Coleen Murphy, Princeton University, United States
                                                                                                                                                    • Olga Troyanskaya, Princeton University, United States

                                                                                                                                                    Area Session Chair: Judith Blake

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The etiology of complex human diseases is challenging to study, as they are likely a combination of many environmental and genetic factors. Elucidating the molecular basis of pathophysiologies of such diseases requires a combination of systems-level analyses in human and experimental investigations in model organisms. To fully leverage model systems to study human disease, we propose a framework that can combine human quantitative genetics results and computational models of model organism tissue biology to drive experimental screens for disruption of disease-relevant processes and identify candidate disease genes. Specifically, we develop a novel semi-supervised regularized Bayesian integration method to integrate a large compendium of heterogeneous datasets, primarily composed of publicly available expression datasets in model organism C. elegans. Using this method, we construct 203 tissue- and cell-type specific networks, and we demonstrate the accuracy of these networks in capturing tissue-specific functional signal, even for very small tissues and specific cell types. Combining these model organism functional maps with human quantitative genetics signal, we make disease gene predictions for 10 different diseases based on GWAS studies. Focusing on Parkinson’s disease, we further experimentally screen 45 of the top Parkinson's disease predictions for age-related motility defects. Analysis of 13,255 worms across 1,823 videos identifies significant age-related Parkinson's endophenotypes. Genes that correspond to strong phenotypes are prime candidates for further inquiry in human and could eventually be pursued as potential therapeutic targets.

                                                                                                                                                    TP063 (PT) - Jumping across biomedical contexts using compressive data fusion
                                                                                                                                                    Date: Monday, July 11 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere BCD
                                                                                                                                                    Theme: DATA / DISEASE
                                                                                                                                                    • Marinka Zitnik, Stanford University, United States
                                                                                                                                                    • Blaz Zupan, University of Ljubljana, Slovenia

                                                                                                                                                    Area Session Chair: Russell Schwartz

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation:
                                                                                                                                                    The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects---such as a gene and a disease---can be related in different ways, for example, directly via gene-disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling.

                                                                                                                                                    Results:
                                                                                                                                                    We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous data sets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene-disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics.

                                                                                                                                                    TP064 (PT) - Multi-Genome Scaffold Co-Assembly Based on the Analysis of Gene Orders and Genomic Repeats
                                                                                                                                                    Date: Monday, July 11 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Sergey Aganezov, Computational Biology Institute & Department of Mathematics, The George Washington University, United States
                                                                                                                                                    • Max Alekseyev, George Washington University, United States

                                                                                                                                                    Area Session Chair: Pedja Radivojac

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Advances in the DNA sequencing technology over the past decades have increased the volume of raw sequenced genomic data available for further assembly and analysis. While there exist many software tools for assembly of sequenced genomic material, they often experience difficulties with reconstructing complete chromosomes. Major obstacles include uneven read coverage and presence of long similar DNA subsequences (repeats). Genome assemblers therefore often are able to reliably reconstruct only long fragments, called scaffolds. We present a method for simultaneous co-assembly of all fragmented genomes (represented as collections of scaffolds rather than chromosomes) in a given set of annotated genomes. The method is based on the analysis of gene orders and relies on the evolutionary model, which includes genome rearrangements as well as gene insertions and deletions. It can also utilize information about genomic repeats and the phylogenetic tree of the given genomes, further improving their assembly quality.

                                                                                                                                                    TP065 (PT) - Most of the tight positional conservation of transcription factor binding sites near the transcription start site is due to their co-localization within regulatory modules
                                                                                                                                                        Cancelled
                                                                                                                                                    Date: Monday, July 11 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: GENES / PROTEINS
                                                                                                                                                    • Natalia Acevedo-Luna, Iowa State University, United States
                                                                                                                                                    • Leonardo Mariño-Ramírez, NIH, United States
                                                                                                                                                    • Armand Halbert, NIH, United States
                                                                                                                                                    • Ulla Hansen, Boston University, United States
                                                                                                                                                    • David Landsman, NIH, United States
                                                                                                                                                    • John Spouge, NIH, United States

                                                                                                                                                    Area Session Chair: Reinhard Schneider

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA. Consider a “Subunit Hypothesis”: sometimes, different TF complexes contain inexact copies of a subunit that coordinates the regulation of specific genes. Then, within the RMs for the genes, transcription factor binding sites should display tightly consistent positions relative to each other, and possibly, consistent positions relative to the transcription start site (TSS), too. Our statistics found 43 significant sets of TF motifs with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a “gene group” of 135 to 3304 genes, some groups independently validated with FDR<10-4. The Subunit Hypothesis also implies that motifs corresponding to two TFs in a subunit should co-occur more than by chance alone, “enriching” the intersection of the gene groups corresponding to the two TFs. Of the 43 significant gene groups, we found 779 pairs of gene groups with significantly enriched intersections, many independently validated. A user-friendly web site at http://go.usa.gov/3kjsH permits experimental biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs. Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM that binds a particular TF subunit.

                                                                                                                                                    TP066 (PT) - SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets
                                                                                                                                                    Date: Monday, July 11 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / SYSTEMS
                                                                                                                                                    • Jing Guo, School of Computer Engineering, Nanyang Technological University, Singapore
                                                                                                                                                    • Hui Liu, Changzhou University, China
                                                                                                                                                    • Jie Zheng, School of Computer Engineering, Nanyang Technological University, Singapore

                                                                                                                                                    Area Session Chair: Judith Blake

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    250-word Scientific Justification

                                                                                                                                                    Synthetic lethality (SL) is a type of genetic interaction between two genes such that simultaneous perturbations of the two genes result in cell death, while a perturbation of either gene alone is not lethal. Hence, the inhibition of SL partners of genes with cancer-specific mutations could selectively kill cancer cells but spare normal cells. Therefore, SL is emerging as a promising anticancer strategy that could potentially overcome the drawbacks of traditional chemotherapies by reducing severe side effects. However, there has not been a comprehensive database dedicated to collecting SL pairs and related knowledge. In this paper, we propose a comprehensive database, SynLethDB (http://histone.sce.ntu.edu.sg/SynLethDB/), which contains SL pairs collected from biochemical assays, computational predictions and text mining results on human and four model species, i.e. mouse, fruit fly, worm and yeast. For each SL pair, a confidence score was calculated by integrating individual scores derived from different evidence sources. We also developed a statistical analysis module to estimate the sensitivity of cancer cells to drugs targeting human SL partners, based on large-scale genomics data, gene expression profiles and drug sensitivity profiles on more than 1000 cancer cell lines. To help users access and mine the wealth of the data, functionalities such as search and filtering, orthology search, gene set enrichment analysis as well as a user-friendly web interface have been implemented to facilitate data mining and interpretation. SynLethDB would be a useful resource for biomedical research community and pharmaceutical industry.



                                                                                                                                                    150-word Presentation Description

                                                                                                                                                    Synthetic lethality (SL) is a type of genetic interaction between two genes such that simultaneous perturbations of the two genes result in cell death, while a perturbation of either gene alone is not lethal. Hence, the inhibition of SL partners of genes with cancer-specific mutations could selectively kill cancer cells but spare normal cells. Therefore, SL is an emerging anticancer strategy that could potentially overcome the drawbacks of traditional chemotherapies by reducing severe side effects. However, there has not been a comprehensive database dedicated to collecting SL pairs and related knowledge. In this talk, I will present the SynLethDB database (http://histone.sce.ntu.edu.sg/SynLethDB/), which contains SL pairs collected from chemical assays and computational predictions on human and model species. I will introduce the computational problem of SL prediction, with SynLethDB as benchmark data. Biologists can use the knowledge and data resources to guide wet-lab screenings of SL using newest technologies (e.g. CRISPR-Cas9).



                                                                                                                                                    Source of Original Publication:
                                                                                                                                                    Jing Guo, Hui Liu, Jie Zheng. SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets. Nucleic Acids Research, 44 (D1): D1011 – D1017, 2016 (Impact Factor = 9.112).

                                                                                                                                                    TP067 (PT) - CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations
                                                                                                                                                    Date: Monday, July 11 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere BCD
                                                                                                                                                    Theme: DATA / DISEASE
                                                                                                                                                    • Maria Chikina, University of Pittsburgh, United States
                                                                                                                                                    • Stuart Sealfon, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Elena Zaslavsky, Icahn School of Medicine at Mount Sinai, United States

                                                                                                                                                    Area Session Chair: Russell Schwartz

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Identifying alterations in gene expression associated with different clinical states is important for the study of human biology. However, clinical samples used in gene expression studies are often derived from heterogeneous mixtures with variable cell-type composition, complicating statistical analysis.

                                                                                                                                                    Considerable effort has been devoted to modeling sample heterogeneity, and presently there are many methods that can estimate cell proportions or pure cell-type expression from mixture data. However, there is no method that comprehensively addresses mixture analysis in the context of differential expression without relying on additional proportion information, which can be inaccurate and is frequently unavailable.

                                                                                                                                                    In this study we consider a clinically relevant situation where neither accurate proportion estimates nor pure cell expression is of direct interest, but where we are rather interested in detecting and interpreting relevant differential expression in mixture samples. We develop a method, cell-type COmputational Differential Estimation (CellCODE), that addresses the specific statistical question directly, without requiring a physical model for mixture components. Our approach is based on latent variable analysis and is computationally transparent, requires no additional experimental data, yet outperforms existing methods that use independent proportion measurements. CellCODE has few parameters that are robust and easy to interpret. The method can be used to track changes in proportion, improve power to detect differential expression and assign the differentially expressed genes to the correct cell-type.

                                                                                                                                                    TP068 (PT) - deBWT: parallel construction of Burrows-Wheeler Transform for large collection of ge-nomes with de Bruijn-branch encoding
                                                                                                                                                    Date: Monday, July 11 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / DATA
                                                                                                                                                    • Bo Liu, Center for Bioinformatics, Harbin Institute of Technology, China
                                                                                                                                                    • Dixian Zhu, Center for Bioinformatics, Harbin Institute of Technology, China
                                                                                                                                                    • Yadong Wang, Center for Bioinformatics, Harbin Institute of Technology, China

                                                                                                                                                    Area Session Chair: Pedja Radivojac

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: With the development of high-throughput sequencing, the number of assembled ge-nomes continues to rise. It is critical to well organize and index many assembled genomes to promote future genomics studies. Burrows-Wheeler Transform (BWT) is an important data structure of genome indexing which has many fundamental applications; however, it is still non-trivial to construct BWT for large collection of genomes, especially for highly similar or repetitive genomes. Moreover, the state-of-the-art approaches cannot well support scalable parallel computing due to their incremental nature, which is a bottleneck to utilize modern computers to accelerate BWT construction.
                                                                                                                                                    Results: We propose de Bruijn branch-based BWT constructor (deBWT), a novel parallel BWT con-struction approach. DeBWT innovatively represents and organizes the suffixes of input sequence with a novel data structure, de Bruijn branch encoding. This data structure takes the advantage of de Bruijn graph to facilitate the comparison between the suffixes with long common prefix, which breaks the bottleneck of the BWT construction of repetitive genomic sequences. Meanwhile, deBWT also utilizes the structure of de Bruijn graph for reducing unnecessary comparisons between suffixes. The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing. It is well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.
                                                                                                                                                    Availability: deBWT is implemented in C language, the source code is available at https://github.com/hitbc/deBWT or https://github.com/DixianZhu/deBWT
                                                                                                                                                    Contact: ydwang@hit.edu.cn

                                                                                                                                                    TP069 (PT) - Finding correct protein-protein docking models using ProQDock
                                                                                                                                                    Date: Monday, July 11 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Sankar Basu, Linköping University, Sweden
                                                                                                                                                    • Bjorn Wallner, Linkoping University, Sweden

                                                                                                                                                    Area Session Chair: Reinhard Schneider

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Protein-protein interactions are a key in virtually all biological process. For a detailed understanding of the biological processes, the structure of the protein complex is essential. Given the current experimental techniques for structure determination, the vast majority of all protein com-plexes will never be solved by experimental techniques. In lack of experimental data, computational docking methods can be used to predict the structure of the protein complex. A common strategy is to generate many alternative docking solutions (atomic models) and then use a scoring function to select the best. The success of the computational docking technique is, to a large degree, depend-ent on the ability of the scoring function to accurately rank and score the many alternative docking models.
                                                                                                                                                    Results: Here, we present ProQDock, a scoring function that predicts the absolute quality of dock-ing model measured by a novel protein docking quality score (DockQ). ProQDock uses support vec-tor machines trained to predict the quality of protein docking models using features that can be cal-culated from the docking model itself. By combining different types of features describing both the protein-protein interface and the overall physical chemistry it was possible to improve the correlation with DockQ from 0.25 for the best individual feature (EC) to 0.49 for the final version of ProQDock. ProQDock performed better than the state-of-the-art methods ZRANK and ZRANK2 in terms of cor-relations, ranking and finding correct models on an independent test set. Finally, we also demon-strate that it is possible to combine ProQDock with ZRANK and ZRANK2 to improve performance even further.

                                                                                                                                                    TP070 (PT) - Gene essentiality and synthetic lethality in haploid human cells
                                                                                                                                                    Date: Monday, July 11 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES / SYSTEMS
                                                                                                                                                    • Jacques Colinge, IRCM Inserm U1194, University of Montpellier, ICM, France
                                                                                                                                                    • Vincent Blomen, NKI, Netherlands
                                                                                                                                                    • Peter Májek, CeMM, Austria
                                                                                                                                                    • Lucas Jae, NKI, Netherlands
                                                                                                                                                    • Johannes Bigenzahn, CeMM, Austria
                                                                                                                                                    • Joppe Nieuwenhuis, NKI, Netherlands
                                                                                                                                                    • Jacqueline Staring, NKI, Netherlands
                                                                                                                                                    • Roberto Sacco, CeMM, Austria
                                                                                                                                                    • Ferdy van Diemen, NKI, Netherlands
                                                                                                                                                    • Nadine Olk, CeMM, Austria
                                                                                                                                                    • Alexey Stukalov, CeMM, Austria
                                                                                                                                                    • Caleb Marceau, Stanford University School of Medicine, United States
                                                                                                                                                    • Hans Janssen, NKI, Netherlands
                                                                                                                                                    • Jan Carette, Stanford University School of Medicine, United States
                                                                                                                                                    • Keiryn Bennett, CeMM, Austria
                                                                                                                                                    • Giulio Superti-Furga, CeMM, Austria
                                                                                                                                                    • Thijn Brummelkamp, NKI, Netherlands

                                                                                                                                                    Area Session Chair: Judith Blake

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Among the many things one might want to know about a human cell, the list of its indispensable components, i.e. genes, is of great interest. Due to technical barriers, transposition of pioneering work done in yeast has taken years. We present a first genome-wide mutational screen conducted in human haploid cells that unraveled ~2000 genes required for fitness in culture condition. Bioinformatic analyses were performed to extract global characteristic of human essential genes and the interactions the have with other genes. By performing similar screens on cells depleted of specific genes we could obtain a synthetic lethality network around the secretory pathway, thus providing a first genetic interaction network in human cells obtained by mutagenesis.

                                                                                                                                                    Finally, we will comment on differences and similarities with concomitant essential gene lists published by two other groups (Wang et al., Science, 2015; Hart et al., Cell, 2015).

                                                                                                                                                    TP071 (PT) - Solving the influence maximization problem on biological networks; a case study involving the cell cycle regulatory network in Saccharomyces Cerevisiae
                                                                                                                                                    Date: Monday, July 11 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere BCD
                                                                                                                                                    Theme: DATA / SYSTEMS
                                                                                                                                                    • David Gibbs, Institute for Systems Biology, United States
                                                                                                                                                    • Ilya Shmulevich, Institute for Systems Biology, United States

                                                                                                                                                    Area Session Chair: Russell Schwartz

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The Influence Maximization Problem (IMP) aims to discover the set of nodes with the greatest influence on network dynamics. The problem has previously been applied in epidemiology and social network analysis. Here, we demonstrate the application to cell cycle regulatory network analysis of Saccharomyces cerevisiae.
                                                                                                                                                    Fundamentally, gene regulation is linked to the flow of information. Therefore, our implementation of the IMP was framed as an information theoretic problem on a diffusion network. Utilizing all regulatory edges from YeastMine, gene expression dynamics were encoded as edge weights using a variant of time lagged transfer entropy, a method for quantifying information transfer across variables. Influence, for a particular number of sources, was measured using a diffusion model based on Markov chains with absorbing states. By maximizing over different numbers of sources, an influence ranking on genes was produced.
                                                                                                                                                    The influence ranking was compared to other metrics of network centrality. Although ‘top genes’ from each centrality ranking contained well known cell cycle regulators, there was little agreement and no clear winner. However, it was found that influential genes tend to directly regulate or sit upstream of genes ranked by other centrality measures. This is quantified by computing node reachability between gene sets; on average, 59% of central genes can be reached when starting from the influential set, compared to 7% of influential genes when starting at another centrality metric.
                                                                                                                                                    Influential nodes are critical sources of information flow, potentially impacting the state of the network, potentially leading to disease.

                                                                                                                                                    TP072 (PT) - Compacting de Bruijn graphs from sequencing data quickly and in low memory
                                                                                                                                                    Date: Monday, July 11 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / DATA
                                                                                                                                                    • Rayan Chikhi, CNRS, France
                                                                                                                                                    • Antoine Limasset, IRISA, France
                                                                                                                                                    • Paul Medvedev, Pennsylvania State University, United States

                                                                                                                                                    Area Session Chair: Pedja Radivojac

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem.

                                                                                                                                                    We present an algorithm and a tool BCALM 2 for the compaction of de Bruijn graphs. BCALM 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, BCALM 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied BCALM 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, BCALM 2 is at least an order of magnitude more efficient than other available methods.

                                                                                                                                                    TP073 (PT) - HUMAN PROTEIN COMPLEX MAP: INTEGRATION OF 10K MASS SPECTROMETRY EXPERIMENTS
                                                                                                                                                    Date: Monday, July 11 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Kevin Drew, University of Texas at Austin, United States
                                                                                                                                                    • Edward Marcotte, University of Texas at Austin, United States

                                                                                                                                                    Area Session Chair: Reinhard Schneider

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Protein complexes carry out essential functions in the cell but we currently lack knowledge of their composition, formation and function. Several recent studies using high throughput discovery of protein interactions have allowed the construction of protein complex maps but the protein overlap of these maps are limited. Here we take an integrated approach by combining protein interaction experiments from multiple published mass spectrometry datasets and construct a more complete human protein complex map. We evaluate both pairwise interactions and complexes using a novel clique-based comparison method and show improved performance over the published complex maps. Additionally, we find several new complexes including ones with enrichment for developmental disorders suggesting candidate disease genes. The expansiveness and accuracy of this complex map yields greater understanding of cellular function and provides avenues for better disease characterization.

                                                                                                                                                    TP074 (PT) - Influence maximization in time bounded network identifies transcription factors regulating perturbed pathways
                                                                                                                                                    Date: Monday, July 11 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Kyuri Jo, Seoul National University, Korea, Republic of
                                                                                                                                                    • Inuk Jung, Seoul National University, Korea, Republic of
                                                                                                                                                    • Ji Hwan Moon, Seoul National University, Korea, Republic of
                                                                                                                                                    • Sun Kim, Seoul National University, Korea, Republic of

                                                                                                                                                    Area Session Chair: Judith Blake

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    To understand the dynamic nature of the biological process, it is crucial to identify perturbed pathways in an altered environment and also to infer regulators that trigger the response. Current time-series analysis methods, however, are not powerful enough to identify perturbed pathways and regulators simultaneously. Widely used methods include methods to determine gene sets such as differentially expressed genes or gene clusters and these genes sets need to be further interpreted in terms of biological pathways using other tools. Most pathway analysis methods are not designed for time series data and they do not consider gene-gene influence on the time dimension. In this paper, we propose a novel time-series analysis method TimeTP for determining transcription factors regulating pathway perturbation, which narrows the focus to perturbed sub-pathways and utilizes the gene regulatory network and protein-protein interaction network to locate transcription factors triggering the perturbation. TimeTP first identifies perturbed sub-pathways that propagate the expression changes along the time. Starting points of the perturbed sub-pathways are mapped into the network and the most influential transcription factors are determined by influence maximization technique. The analysis result is visually summarized in TF-Pathway map in time clock. TimeTP was applied to PIK3CA knock-in dataset and found significant sub-pathways and their regulators relevant to the PIP3 signaling pathway.

                                                                                                                                                    TP075 (PT) - Scalable Tools for Quantitative Analysis of Chemical-Genetic Interactions from Sequencing-Based Chemical-Genetic Interaction Screens
                                                                                                                                                    Date: Monday, July 11 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere BCD
                                                                                                                                                    Theme: DATA / SYSTEMS
                                                                                                                                                    • Scott Simpkins, University of Minnesota, United States
                                                                                                                                                    • Justin Nelson, University of Minnesota, United States
                                                                                                                                                    • Raamesh Desphande, University of Minnesota, United States
                                                                                                                                                    • Jeffrey Piotrowski, Yumanity Therapeutics, United States
                                                                                                                                                    • Sheena Li, RIKEN Institute for Sustainable Resource Science, Japan
                                                                                                                                                    • Charles Boone, University of Toronto, Canada
                                                                                                                                                    • Chad Myers, University of Minnesota, United States

                                                                                                                                                    Area Session Chair: Russell Schwartz

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Recent improvements in the throughput of chemical-genetic interaction screens have necessitated the development of new, scalable pipelines for processing raw sequencing data from these experiments and interpreting the resulting chemical-genetic interaction profiles. We developed two computational tools, BEAN-counter and CG-TARGET, to respectively process and interpret the large influx of data from high-throughput chemical-genomic screens. These pipelines were applied to chemical-genetic interaction screens of more than 18,000 compounds in S. cerevisiae, ultimately yielding more than 2,000 compounds with high confidence predictions to biological process targets. We confirmed that our process-level target predictions overlap with the known functions of compounds and, importantly, enable us to discover novel compound modes-of-action. Additionally, these tools provided the foundation for new investigations into the nature of chemical interactions with biological systems.

                                                                                                                                                    TP076 (PT) - Succinct Colored de Bruijn Graphs
                                                                                                                                                    Date: Monday, July 11 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / DATA
                                                                                                                                                    • Martin Muggli, Colorado State University, United States
                                                                                                                                                    • Alex Bowe, National Institute of Informatics, Chiyoda-ku, Tokyo, Japan, Japan
                                                                                                                                                    • Travis Gagie, Department of Computer Science,University of Helsinki, Finland
                                                                                                                                                    • Robert Raymond, Colorado State University, United States
                                                                                                                                                    • Noelle R. Noyes, Colorado State University, United States
                                                                                                                                                    • Paul Morley, Colorado State University, United States
                                                                                                                                                    • Keith Belk, Colorado State University, United States
                                                                                                                                                    • Simon Puglisi, University of Helsinki, Finland
                                                                                                                                                    • Christina Boucher, Colorado State University, United States

                                                                                                                                                    Area Session Chair: Pedja Radivojac

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    MOTIVATION: Iqbal et al. (Nature Genetics, 2012) introduced the colored de Bruijn graph, a variant of the classic de Bruijn graph, which is aimed at "detecting and genotyping simple and complex genetic variants in an individual or population".
                                                                                                                                                    Because they are intended to be applied to massive population level data, it is essential that the graphs be represented efficiently.
                                                                                                                                                    Unfortunately, current succinct de Bruijn graph representations are not directly applicable to the colored de Bruijn graph, which require additional information to be succinctly encoded as well as support for non-standard traversal operations.
                                                                                                                                                    RESULTS: Our data structure dramatically reduces the amount of memory required to store and use the colored de Bruijn graph, with some penalty to runtime, allowing it to be applied in much larger and more ambitious sequence projects than was previously possible. In particular, we use our method along with a custom curated database of antimicrobial resistant genes to track changes in the resistome across food production facilities. A short video of our work is available at http://cdbg.martindmuggli.com.

                                                                                                                                                    TP077 (PT) - An Integer Programming Framework for Inferring Disease Complexes from Network Data
                                                                                                                                                    Date: Monday, July 11 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS / DISEASE
                                                                                                                                                    • Arnon Mazza, Tel Aviv University, Israel
                                                                                                                                                    • Konrad Klockmeier, Max Delbrück Center for Molecular Medicine, Germany
                                                                                                                                                    • Erich Wanker, Max Delbrück Center for Molecular Medicine, Germany
                                                                                                                                                    • Roded Sharan, School of computer science, Tel Aviv university, Israel

                                                                                                                                                    Area Session Chair: Reinhard Schneider

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Unraveling the molecular mechanisms that underlie disease calls for methods that go beyond the identification of single causal genes to inferring larger protein assemblies that take part in the disease process. Here we develop an exact, integer-programming-based method for associating protein complexes with disease. Our approach scores proteins based on their proximity in a protein-protein interaction network to a prior set that is known to be relevant for the studied disease. These scores are combined with interaction information to infer densely interacting protein complexes that are potentially disease-associated. We show that our method outperforms previous ones and leads to predictions that are well supported by current experimental data and literature knowledge.

                                                                                                                                                    TP078 (PT) - Mogrify: a predictive system for cell reprogramming
                                                                                                                                                        Cancelled
                                                                                                                                                    Date: Monday, July 11 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: GENES / SYSTEMS
                                                                                                                                                    • Owen Rackham, Duke-NUS, Singapore
                                                                                                                                                    • Jaber Firas, Monash University, Australia
                                                                                                                                                    • Jose Polo, Monash University, Australia
                                                                                                                                                    • Julian Gough, University of Bristol, United Kingdom

                                                                                                                                                    Area Session Chair: Judith Blake

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Transdifferentiation, the process of converting from one cell type to another without going through a pluripotent state, has great promise for regenerative medicine. The identification of key transcription factors for reprogramming is currently limited by the cost of exhaustive experimental testing of plausible sets of factors, an approach that is inefficient and unscalable. Here we present a predictive system (Mogrify http://mogrify.net) that combines gene expression data with regulatory network information to predict the reprogramming factors necessary to induce cell conversion. We have applied Mogrify to over 300 human cell types and tissues, defining an atlas of cellular reprogramming. Mogrify correctly predicts the transcription factors used in known transdifferentiations. Furthermore, we validated two new transdifferentiations predicted by Mogrify. We provide a practical and efficient mechanism for systematically implementing novel cell conversions, facilitating the generalization of reprogramming of human cells. Predictions are made available to help rapidly further the field of cell conversion.

                                                                                                                                                    TP079 (PT) - Compressive Mapping for Next-Generation Sequencing
                                                                                                                                                    Date: Tuesday, July 12 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Deniz Yorukoglu, Massachusetts Institute of Technology, United States
                                                                                                                                                    • Yun William Yu, Massachusetts Institute of Technology, United States
                                                                                                                                                    • Jian Peng, University of Illinois at Urbana-Champaign, United States
                                                                                                                                                    • Bonnie Berger, Massachusetts Institute of Technology, United States

                                                                                                                                                    Area Session Chair: Scott Markel

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The high cost of mapping next-generation sequencing (NGS) read data onto a reference is a major bottleneck to sequencing analysis pipelines. We introduce COmpressive Read-mapping Accelerator (CORA), a framework that first maps reads to reads and reference to reference, exploiting inherent redundancies in both read and reference sequences, to accelerate read to reference mapping. We use this framework to map paired-end reads from the 1000 Genomes Project to the human reference, eliminating redundant sequence comparisons and improving time and sensitivity by orders of magnitude, particularly for multi-reads. The relative speed advantage of our approach will increase with the explosion of NGS data and advances in sequencing technologies, allowing researchers to keep pace with this data onslaught.

                                                                                                                                                    TP080 (PT) - Interactome based drug discovery and disease-disease connections
                                                                                                                                                    Date: Tuesday, July 12 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS / DISEASE
                                                                                                                                                    • Gaurav Chopra, Purdue University, United States
                                                                                                                                                    • Ram Samudrala, SUNY Buffalo, United States

                                                                                                                                                    Area Session Chair: Natasa Przulj

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    We have developed a Computational Analysis of Novel Drug Opportunities (CANDO) platform (http://protinfo.org/cando/) funded by a 2010 NIH Director's Pioneer Award that analyzes compound-proteome interaction signatures to determine drug behavior, in contrast to traditional single (or few) target approaches. Our platform implements a modeling pipeline that generates an interaction matrix between 3,733 human approved drugs and 48,278 proteins using a hierarchical chem- and bio-informatic fragment-based docking with dynamics protocol (~ 1 billion predicted interactions evaluated, considering multiple binding sites per protein). The platform then uses similarity of interaction signatures across all proteins indicative of similar functional behavior and nonsimilar signatures for off- and anti-target (side) effects, in effect inferring homology of compound/drug behavior at a proteomic level. The benchmarking accuracy using this approach to rank compounds for over 650 indications/diseases is ~36%, in contrast to accuracies of ~0.2% obtained when using scrambled control matrices. We prospectively validated “high value” predictions in vitro and in vivo preclinical studies for more than a dozen indications, including type 1 diabetes, herpes, dental caries, dengue, tuberculosis, malaria, hepatitis B, and different cancers. Our drug prediction accuracy is ~35% across the nine indications, where 57/162 compounds validated thus far show comparable or better activity than an existing drug, or micromolar inhibition at the cellular level, and serve as novel repurposeable therapies. Taken together, with benchmarking accuracy and the effect of druggable protein classes on repurposing accuracy, our multitargeting results indicate that a large number of protein structures with diverse fold space and a specific polypharmacological interactome is necessary for accurate drug predictions using our proteomic and evolutionary drug discovery and repurposing platform. Our approach is broadly applicable beyond repurposing, enables personalized and precision medicine, and foreshadows a new era of faster drug and target discovery using novel disease-disease connections.

                                                                                                                                                    TP081 (PT) - Classifying Cancer Samples by microRNA Profiles: Read the Fine Print!
                                                                                                                                                    Date: Tuesday, July 12 10:10 am - 10:30 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / GENES
                                                                                                                                                    • Roni Rasnic, The Hebrew University of Jerusalem, Israel
                                                                                                                                                    • Michal Linial, The Hebrew University of Jerusalem, Israel
                                                                                                                                                    • Nathan Linial, The Hebrew University of Jerusalem, Israel

                                                                                                                                                    Area Session Chair: Yves Moreau

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    MicroRNAs (miRNAs) primarily function is in gene regulating and maintaining cell homeostasis. Indeed, carcinogenesis is often represented by drastic perturbations in miRNA profiles. Many cancerous tissues share similar miRNA profiles with only few dominating miRNAs. The Cancer Genome Atlas (TCGA) provides a rich resource with thousands of human samples covering >25 major cancer types. Here, we test the significant of miRNA information from TCGA in characterizing the cancer tissues and distinguish their types and tissue origin. We apply an SVM multiclass classifier for assessing the separation power between cancer types and some of their healthy tissues. The ML approach was applied to 8522 samples associated with expression data for 1047 miRNAs. We find that the set of the lowest expressed miRNAs that comprises only 0.003% of total miRNA reads has a higher separation power. Actually including the complementary set of the highly expressed miRNAs deteriorates the classification success. We are able to improve the identification following a simple discretization of the data, improving the success from 56% by the naïve usage of the miRNA profiles to ~90%. We suggest using the separation capacity of the low expressing miRNAs for characterization of metastatic tumors with unknown tissue origin. Furthermore, we gain surprising and useful insights on classes that suffer a consistent failure in identification.

                                                                                                                                                    TP082 (PT) - RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes
                                                                                                                                                    Date: Tuesday, July 12 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Avi Srivastava, Stony Brook University, United States
                                                                                                                                                    • Hirak Sarkar, Stony Brook University, United States
                                                                                                                                                    • Nitish Gupta, Stony Brook University, United States
                                                                                                                                                    • Rob Patro, Stony Brook University, United States

                                                                                                                                                    Area Session Chair: Scott Markel

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis.

                                                                                                                                                    Results: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap - our tool implementing quasi-mapping - is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we employ to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically-meaningful groups.

                                                                                                                                                    Availability: RapMap is implemented in C++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap.

                                                                                                                                                    Contact: rob.patro@cs.stonybrook.edu

                                                                                                                                                    TP083 (PT) - A convex optimization approach for identification of human tissue-specific interactomes
                                                                                                                                                    Date: Tuesday, July 12 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS / DISEASE
                                                                                                                                                    • Shahin Mohammadi, Purdue University, United States
                                                                                                                                                    • Ananth Grama, Department of Computer Science, Purdue University, United States

                                                                                                                                                    Area Session Chair: Natasa Przulj

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Analysis of organism-specific interactomes has yielded novel insights into cellular function and coordination, understanding of pathology, and identification of markers and drug targets. Genes, however, can exhibit varying levels of cell-type specificity in their expression, and their coordinated expression manifests in tissue-specific function and pathology. Tissue-specific/selective interaction mechanisms have significant applications in drug discovery, as they are more likely to reveal drug targets. Furthermore, tissue-specific transcription factors (tsTFs) are significantly implicated in human disease, including cancers. Finally, disease genes and protein complexes have the tendency to be differentially expressed in tissues in which defects cause pathology. These observations motivate the construction of refined tissue-specific interactomes from organism-specific interactomes.

                                                                                                                                                    Results: We present a novel technique for constructing human tissue-specific interactomes. Using a variety of validation tests (ESEA, GO Enrichment, Disease-Gene Subnetwork Compactness), we show that our proposed approach significantly outperforms state of the art techniques. Finally, using case studies of Alzheimer's and Parkinson's diseases, we show that tissue-specific interactomes derived from our study can be used to construct pathways implicated in pathology and demonstrate the use of these pathways in identifying novel targets.\\

                                                                                                                                                    Availability: http://www.cs.purdue.edu/homes/mohammas/projects/ActPro.html

                                                                                                                                                    TP084 (PT) - RNA sequencing-based cell proliferation analysis across 19 cancers identifies a subset of proliferation-informative cancers with a common survival signature
                                                                                                                                                    Date: Tuesday, July 12 10:30 am - 10:50 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE
                                                                                                                                                    • Brittany Lasseigne, HudsonAlpha Institute for Biotechnology, United States
                                                                                                                                                    • Ryne Ramaker, HudsonAlpha Institute for Biotechnology and The University of Alabama at Birmingham, United States
                                                                                                                                                    • Laura Palacio, HudsonAlpha Institute for Biotechnology, United States
                                                                                                                                                    • David Gunther, HudsonAlpha Institute for Biotechnology, United States
                                                                                                                                                    • Sara Cooper, HudsonAlpha Institute for Biotechnology, United States
                                                                                                                                                    • Richard Myers, HudsonAlpha Institute for Biotechnology, United States

                                                                                                                                                    Area Session Chair: Yves Moreau

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Despite advances in cancer diagnosis and treatment strategies, it has been difficult to identify robust prognostic signatures in cancer. Cell proliferation has long been recognized as a potential prognostic marker in cancer, but has not been investigated across multiple cancers using tissue-based RNA sequencing. Here we explore the role of cell proliferation across 19 cancers (n=6,312 patients) from The Cancer Genome Atlas project by employing a ‘proliferative index’ derived from gene expression associated with PCNA expression. This proliferative index is significantly associated with patient survival (Cox, p-value<0.05) in 8/19 cancers, which we have defined as ‘proliferation-informative cancers’ (PICs). In PICs the proliferative index is strongly correlated with tumor stage and nodal invasion. Furthermore, PICs demonstrate lower proliferation machinery expression relative to other cancers (Spearman, p=1.76E-23). Transcriptome-wide predictive survival modeling using multivariate Cox regression with L1-penalized log partial likelihood (LASSO) for feature selection outperformed the ‘proliferative-index’ in 18/19 cancers. Survival associated expression patterns were relatively unique between cancers, however PICs have a common survival signature of 86 genes (Cox, p<0.05 across all 8 cancers). Additionally, we find that proliferative index is significantly associated with somatic mutation burden (Spearman, p=1.76E-23). This study presents cancers for which cell proliferation may be an important prognostic marker and demonstrates that modern machine learning techniques can identify survival models more predictive than, and independent of, proliferative index for most cancers. We also prevent evidence for cell proliferation as a proxy for clinical parameters and confirm an association between cell proliferation and somatic mutation burden across cancers.

                                                                                                                                                    TP085 (PT) - ADAGE-Based Extraction of Biological Context from Public Gene Expression Data
                                                                                                                                                    Date: Tuesday, July 12 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / DATA
                                                                                                                                                    • Jie Tan, Geisel School of Medicine at Dartmouth, United States
                                                                                                                                                    • John Hammond, Geisel School of Medicine at Dartmouth, United States
                                                                                                                                                    • Deborah Hogan, Geisel School of Medicine at Dartmouth, United States
                                                                                                                                                    • Casey Greene, University of Pennsylvania, United States

                                                                                                                                                    Area Session Chair: Scott Markel

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    In this talk, I will introduce the overarching question that I’m addressing in my thesis: “How do we extract biological patterns from heterogeneous public gene expression data using unsupervised methods.” To address this challenge, we recently developed and published ADAGE (Analysis using Denoising Autoencoders for Gene Expression) in the journal mSystems. ADAGE is a method based on deep learning that extracts features representing biological states of an organism from the organism’s complete expression compendium without requiring pathway annotations or other curated knowledge. In this talk, I’ll primarily highlight the ADAGE method, and I’ll demonstrate how ADAGE can be applied to analyzing new RNA-Seq datasets. I’ll cover how ADAGE can be used to generate new hypotheses about how different environments activate distinct pathways. I’ll wrap up by mentioning an upcoming contribution: an approach that we call eADAGE that significantly improves the abundance and completeness of pathways extracted by ADAGE.

                                                                                                                                                    TP086 (PT) - Precision drug repurposing and multi-target drug design using structural systems pharmacology
                                                                                                                                                    Date: Tuesday, July 12 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS / DISEASE
                                                                                                                                                    • Thomas Hart, Rockefeller University, United States
                                                                                                                                                    • Shihab Dider, Hunter College, CUNY, United States
                                                                                                                                                    • Weiwei Han, Jilin University, China
                                                                                                                                                    • Hua Xu, University of Texas Health Center, United States
                                                                                                                                                    • Zhongming Zhao, University of Texas Health Center, United States
                                                                                                                                                    • Philip Bourne, National Institute of Health, United States
                                                                                                                                                    • Lei Xie, Hunter College, The City University of New York, United States

                                                                                                                                                    Area Session Chair: Natasa Przulj

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Precision medicine is an emerging method for disease treatment. However, its advance is hindered by a lack of mechanistic understanding of the energetics and dynamics of genome-wide drug-target and genetic interactions. To address this challenge, we have developed a novel structural systems pharmacology approach to elucidate molecular basis and genetic biomarkers of drug action. We have applied our approach to repurposing metformin, an anti-diabetes drug, for precision anti-cancer therapy. Through searching the human structural proteome, we identified putative metformin binding targets, and experimentally verified the predictions. Subsequently, we linked these binding targets to genes whose expressions are altered by metformin through protein-protein interactions, and identified network biomarkers of drug phenotypic response. The key nodes in genetic networks are largely consistent with the existing experimental evidence. Their interactions can be affected by the observed cancer mutations. This study demonstrates that structural systems pharmacology is a powerful tool for precision medicine.

                                                                                                                                                    TP087 (PT) - Data-Driven Analysis of Lymphocyte Infiltration in Breast Cancer Development and Progression
                                                                                                                                                    Date: Tuesday, July 12 10:50 am - 11:10 am
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE
                                                                                                                                                    • Ruth Dannenfelser, Princeton University, United States
                                                                                                                                                    • Josie Ursini-Siegel, Lady Davis Institute for Medical Research, Canada
                                                                                                                                                    • Vessela Kristensen, Radiumhospitalet, Norway
                                                                                                                                                    • Olga Troyanskaya, Princeton University, United States

                                                                                                                                                    Area Session Chair: Yves Moreau

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The tumor microenvironment is now widely recognized for its role in tumor progression, treatment response, and clinical outcome. The intratumoral immunological landscape, in particular, has been shown to exert both pro-tumorigenic and anti-tumorigenic effects. Thus far, direct detailed studies of the cell composition of tumor infiltration have been limited; with some studies giving approximate quantifications using immunohistochemistry and other small studies obtaining detailed measurements by laboriously isolating cells from newly excised tumors and sorting them using flow cytometry. Herein we utilize a machine learning based approach to identify lymphocyte markers with which we can quantify the presence of B cells, cytotoxic T-lymphocytes, T-helper 1, and T-helper 2 cells in any gene expression data set and apply it on the studies of breast tissue. By leveraging many samples from existing large scale studies, we are able to find an inherent cell heterogeneity in clinically characterized immune infiltrates, a strong link between estrogen receptor status and infiltration in normal and tumor tissues, changes with genomic complexity, and identify characteristic differences in lymphocyte expression among molecular groupings. Furthermore, we explore the effects detailed infiltration patterns have on patient survival and changes with anti-estrogen therapy.

                                                                                                                                                    TP088 (PT) - SHARAKU: An algorithm for aligning and clustering read mapping profiles of deep sequencing in non-coding RNA processing
                                                                                                                                                    Date: Tuesday, July 12 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Mariko Tsuchiya, Keio University, Japan
                                                                                                                                                    • Kojiro Amano, Keio University, Japan
                                                                                                                                                    • Masaya Abe, Keio University, Japan
                                                                                                                                                    • Misato Seki, Keio University, Japan
                                                                                                                                                    • Sumitaka Hase, Keio University, Japan
                                                                                                                                                    • Kengo Sato, Keio University, Japan
                                                                                                                                                    • Yasubumi Sakakibara, Keio University, Japan

                                                                                                                                                    Area Session Chair: Scott Markel

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs.
                                                                                                                                                    Results: We developed an algorithm termed SHARAKU to align two read mapping profiles of nextgeneration sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5’-end processing and 3’-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain.

                                                                                                                                                    TP089 (PT) - NUCLEOTIDE SEQUENCE COMPOSITION ADJACENT TO INTRONIC 5’ END IMPROVES TRANSLATION COSTS IN FUNGI
                                                                                                                                                    Date: Tuesday, July 12 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Zohar Zafrir, Tel Aviv University, Israel
                                                                                                                                                    • Tamir Tuller, Tel Aviv University,Department of Biomedical Engineering, Israel

                                                                                                                                                    Area Session Chair: Natasa Przulj

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    It is generally believed that introns are not translated; therefore, the potential intronic sequence features that may be related to the translation step (occurring after splicing) have yet to be thoroughly studied. Focusing on four fungi as model organisms (S. cerevisiae, S. pombe, A. nidulans, and C. albicans) we performed a comprehensive large scale systems biology study to characterize for the first time how translation is encoded in introns and affects their evolution. When considering the reading frame of exons upstream and adjacent to introns, we find evidence suggesting preference of intronic STOP codons close to the intronic 5’end, and that the beginning of introns is selected for codons with higher translation efficiency, presumably resulting in reduced translation and metabolic costs in cases of non-spliced introns. Ribosomal profiling data analysis in S. cerevisiae supports the conjecture that in this organism intron retention frequently occurs; thus, introns are partially translated, and their translation efficiency affects organismal fitness. We also show that this selection is stronger in highly translated and highly spliced genes, but is not associated only with genes with a specific function. Finally, we discuss the potential relation of the reported signals to efficient Nonsense-mediated decay (NMD) pathway due to splicing errors. These new discoveries, supported by population-genetics considerations, contribute to a broader understanding of intron evolution, and of how silent mutations affect gene expression and organismal fitness.

                                                                                                                                                    The talk is based on a paper that will be published (accepted) in the journal: DNA Research; I will also review very recent related studies (Zafrir & Tuller, RNA, 2015; Yofe* and Zafrir* et al., PLoS Genetics, 2014).

                                                                                                                                                    TP090 (PT) - Phenotype Stratification from the Electronic Health Record using Autoencoders
                                                                                                                                                    Date: Tuesday, July 12 11:40 am - 12:00 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Brett K Beaulieu-Jones, University of Pennsylvania, United States
                                                                                                                                                    • Jason H Moore, University of Pennsylvania, United States
                                                                                                                                                    • Casey S Greene, University of Pennsylvania, United States

                                                                                                                                                    Area Session Chair: Yves Moreau

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Genetic association and on a larger scale personalized medicine require highly specific and accurate phenotypes. Research quality phenotyping is costly and can require manual clinician review. Electronic Health Records (EHRs) contain a wealth of phenotypic information but were built for clinical and billing purposes. Effectively extracting this information for research is challenging because many records are sparsely filled and labeled with billing codes. Here, we show the unsupervised use of autoencoders to model patients in the EHR. To evaluate model fit, we created a semi-supervised classifier by adding a random forest to the trained autoencoder. Semi-supervised denoising autoencoders showed classification improvements in simulation models, particularly when small numbers of patients have high quality phenotypes. Deep autoencoders with dropout effectively imputed missing data in the PRO-ACT ALS clinical trial dataset as measured both spike-in imputation accuracy. Deep autoencoder imputed data enabled more accurate ALS disease progression prediction as defined by the ALS Functional Rating System. Finally, we show that despite symptomatic heterogeneity, ALS disease progression appears homogenous with time from onset being the most important predictor.

                                                                                                                                                    TP091 (PT) - Analysis of differential splicing suggests different modes of short-term splicing regulation
                                                                                                                                                    Date: Tuesday, July 12 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Hande Topa, Aalto University, Finland
                                                                                                                                                    • Antti Honkela, University of Helsinki, Finland

                                                                                                                                                    Area Session Chair: Scott Markel

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Alternative splicing is an important mechanism in which the regions of pre-mRNAs are differentially joined in order to form different transcript isoforms. Alternative splicing is involved in the regulation of normal physiological functions but also linked to the development of diseases such as cancer. We analyse differential expression and splicing using RNA-seq time series in three different settings: overall gene expression levels, absolute transcript expression levels and relative transcript expression levels.
                                                                                                                                                    Results: Using estrogen receptor alpha signalling response as a model system, our Gaussian process (GP)-based test identifies genes with differential splicing and/or differentially expressed transcripts. We discover genes with consistent changes in alternative splicing independent of changes in absolute expression and genes where some transcripts change while others stay constant in absolute level. The results suggest classes of genes with different modes of alternative splicing regulation during the experiment.
                                                                                                                                                    Availability: R and Matlab codes implementing the method are available at https://github.com/PROBIC/diffsplicing. An interactive browser for viewing all model fits is available at http://users.ics.aalto.fi/hande/splicingGP/.

                                                                                                                                                    TP092 (PT) - Prediction of Ribosome Footprint Profile Shapes from Transcript Sequences
                                                                                                                                                    Date: Tuesday, July 12 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Tzu-Yu Liu, University of Pennsylvania, United States
                                                                                                                                                    • Yun S. Song, University of California, Berkeley, United States

                                                                                                                                                    Area Session Chair: Natasa Przulj

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Ribosome profiling is a useful technique for studying translational dynamics and quantifying protein synthesis. Applications of this technique have shown that ribosomes are not uniformly distributed along mRNA transcripts. Understanding how each transcript-specific distribution arises is important for unraveling the translation mechanism.

                                                                                                                                                    Results: Here, we apply kernel smoothing to construct predictive features and build a sparse model to predict the shape of ribosome footprint profiles from transcript sequences alone. Our results on Saccharomyces cerevisiae data show that the marginal ribosome densities can be predicted with high accuracy. The proposed novel method has a wide range of applications, including inferring isoform-specific ribosome footprints, designing transcripts with fast translation speeds, and discovering unknown modulation during translation.

                                                                                                                                                    TP093 (PT) - Leveraging electronic medical records for systematic drug repositioning
                                                                                                                                                    Date: Tuesday, July 12 12:00 pm - 12:20 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / DATA
                                                                                                                                                    • Hyojung Paik, UCSF, United States
                                                                                                                                                    • Ah-Young Chung, Korea University, Korea, Republic of
                                                                                                                                                    • Hae-Chul Park, Korea University, Korea, Republic of
                                                                                                                                                    • Rae Woong Park, Ajou University, Korea, Republic of
                                                                                                                                                    • Kyoungho Suk, Kyungpook National University, Korea, Republic of
                                                                                                                                                    • Atul Butte, UCSF, United States
                                                                                                                                                    • Jihyun Kim, Ajou University, Korea, Republic of
                                                                                                                                                    • Hyosil Kim, Ajou University, Korea, Republic of

                                                                                                                                                    Area Session Chair: Yves Moreau

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Prediction of new disease indications for approved drugs by computational methods has been based largely on the genomics signatures of drugs and diseases. We propose a method for drug repositioning that uses the clinical signatures extracted from electronic medical records of a tertiary hospital, including > 9.4 M laboratory tests from > 530,000 patients, in addition to diverse genomics signatures. Cross-validation shows this approach outperforms various predictive models based on genomics signatures. The prediction suggests that terbutaline sulfate, which is widely used for asthma, is a promising candidate for amyotrophic lateral sclerosis for which there are few therapeutic options. In vivo tests, terbutaline sulfate prevents defects in neuromuscular degeneration, and also have a therapeutic potential. Cotreatment with a b2-adrenergic receptor antagonist, butoxamine, suggests that the effect of terbutaline is mediated by activation of b2-adrenergic receptors. Our approach suggests that EMRs are valuable resources for discovering novel indications of drugs.

                                                                                                                                                    TP094 (PT) - Fast and accurate computation of differential splicing across multiple conditions
                                                                                                                                                    Date: Tuesday, July 12 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Jc Entizne, Pompeu Fabra University, Spain
                                                                                                                                                    • A Pages, Pompeu Fabra University, Spain
                                                                                                                                                    • Jl Trincado, Pompeu Fabra University, Spain
                                                                                                                                                    • Gp Alamancos, Pompeu Fabra University, Spain
                                                                                                                                                    • M Skalic, Pompeu Fabra University, Spain
                                                                                                                                                    • N Bellora, Pompeu Fabra University, Spain
                                                                                                                                                    • Eduardo Eyras, Pompeu Fabra University, Spain

                                                                                                                                                    Area Session Chair: Scott Markel

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Abstract

                                                                                                                                                    Alternative splicing plays an essential role in many cellular processes in eukaryotes and high-throughput RNA sequencing has allowed genome-wide studies of splicing across multiple conditions. However, the increasing number of data sets represents a major computational challenge and there are no dedicated tools for the study of splicing changes across multiple conditions. We describe SUPPA (Alamancos et al. 2015), a computational tool to calculate relative inclusion values of alternative splicing events from transcript quantification. Using simulated and experimental datasets, SUPPA achieves similar accuracies compared to standard methodologies but is thousand times faster. We extended SUPPA to calculate differential splicing across multiple conditions. Applied to data across different stages of cell differentiation SUPPA uncovers new splicing regulatory networks governing specific cell fates. SUPPA facilitates the study of splicing regulation across multiple conditions with large number of samples with limited computational resources.

                                                                                                                                                    Impact

                                                                                                                                                    Alternative pre-mRNA splicing diversifies the repertoire of transcripts in multicellular organisms, thereby providing a complex layer of gene regulation. There is increasing evidence that alternative splicing plays a crucial role in development and disease, and it has been identified as a key regulatory mechanism capable of triggering undifferentiated cell states (Gabut et al. 2011, Han et al. 2013). High-throughput sequencing technologies allow the determination of splicing patterns across multiple conditions, but poses major computational challenges. SUPPA meets these challenges by allowing for fast computation of splicing patterns across multiple conditions. SUPPA’s accuracy has been extensively tested using RNA sequencing data for a 23-point time-course of Arabidopsis plants transferred from 20°C to 4°C, and comparing with a RT-PCR platform using the same samples (Zhang et al. 2015). This has moreover facilitated the identification of new splicing changes in response to temperature. We have applied SUPPA to data across different stages of cell differentiation in human to uncover novel regulatory programs of pluripotency controlled by RNA binding proteins. In summary, SUPPA provides a powerful mean to uncover new relevant gene regulatory mechanisms and allows the systematic analysis of splicing by small labs with limited computational resources (Sebestyen et al. 2016). Finally, SUPPA is developed in Python and is an open source project with multiple contributors (https://bitbucket.org/regulatorygenomicsupf/suppa).

                                                                                                                                                    Alamancos et al. (2015). RNA 21(9):1521-31.
                                                                                                                                                    Zhang et al. (2015). New Phytol 208(1):96-101
                                                                                                                                                    Sebestyen et al. (2016) http://biorxiv.org/content/early/2015/08/02/023010
                                                                                                                                                    Gabut et al. (2011). Cell 147, 132–146
                                                                                                                                                    Han et al. (2013). Nature. 20113;498(7453):241-5.

                                                                                                                                                    TP095 (PT) - Rapid Translation Initiation Prevents Mitochondrial Localization of mRNA
                                                                                                                                                    Date: Tuesday, July 12 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS / GENES
                                                                                                                                                    • Thomas Poulsen, National Institute of Advanced Industrial Science and Technology (AIST), Japan
                                                                                                                                                    • Kenichiro Imai, National Institute of Advanced Industrial Science and Technology (AIST), Japan
                                                                                                                                                    • Martin Frith, National Institute of Advanced Industrial Science and Technology (AIST), Japan
                                                                                                                                                    • Paul Horton, National Institute of Advanced Industrial Science and Technology (AIST), Japan

                                                                                                                                                    Area Session Chair: Natasa Przulj

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The mRNA of some, but not all, nuclear encoded mitochondrial proteins localize to the periphery of mitochondria. Previous studies have shown that both the nascent polypeptide chain and an mRNA binding protein play a role in this phenomenon, and have noted a positive correlation between mRNA length and mitochondrial localization. Here, we report the first investigation into the relationship between mRNA translation initiation rate and mRNA mitochondrial localization. Our results indicate that translation initiation promoting factors such as Kozak sequences are associated with cytosolic localization, while inhibiting factors such as 5' UTR secondary structure correlate with mitochondrial localization. Moreover, the frequencies of nucleotides in various positions of the 5' UTR show higher correlation with localization than the 3' UTR. These results suggest that rapid translation initiation may prevent mRNA mitochondrial localization. Interestingly this may explain why short mRNAs, which are thought to initiate translation rapidly, seldom localize to mitochondria. Therefore we propose a model in which translating mRNA has reduced mobility and tends not to reach the mitochondria. Finally, we explore this model with a simulation of mRNA diffusion using previously estimated translation initiation probabilities and confirmed that our model produces localization values similar to those measured in experimental studies.

                                                                                                                                                    TP096 (PT) - Comparative Analyses of Population-scale Phenomic Data in Electronic Medical Records Reveal Race-specific Disease Networks
                                                                                                                                                    Date: Tuesday, July 12 12:20 pm - 12:40 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DISEASE / SYSTEMS
                                                                                                                                                    • Benjamin S. Glicksberg, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Li Li, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Marcus A. Badgeley, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Khader Shameer, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Roman Kosoy, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Noam D. Beckmann, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Nam Pho, Harvard Medical School, United States
                                                                                                                                                    • Joerg Hakenberg, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Meng Ma, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Kristin L. Ayers, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Gabriel E. Hoffman, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Shuyu Dan Li, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Eric E. Schadt, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Chriag J. Patel, Harvard Medical School, United States
                                                                                                                                                    • Rong Chen, Icahn School of Medicine at Mount Sinai, United States
                                                                                                                                                    • Joel T. Dudley, Icahn School of Medicine at Mount Sinai, United States

                                                                                                                                                    Area Session Chair: Yves Moreau

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Underrepresentation of racial groups represents an important challenge and major gap in phenomics research. Most of the current human phenomics research is based primarily on European populations; hence it is an important challenge to expand it to consider other population groups. One approach is to utilize data from EMR databases that contain patient data from diverse demographics and ancestries. The implications of this racial underrepresentation of data can be profound regarding effects on the healthcare delivery and actionability. To the best of our knowledge, our work is the first attempt to perform comparative, population-scale analyses of disease networks across three different populations, namely Caucasian (EA), African American (AA), and Hispanic/Latino (HL).
                                                                                                                                                    Results: We compared susceptibility profiles and temporal connectivity patterns for 1,988 diseases and 37,282 disease pairs represented in a clinical population of 1,025,573 patients. Accordingly, we revealed appreciable differences in disease susceptibility, temporal patterns, network structure, and underlying disease connections between EA, AA, and HL populations. We found 2,158 significantly comorbid diseases for the EA cohort, 3,265 for AA, and 672 for HL. We further outlined key disease pair associations unique to each population as well as categorical enrichments of these pairs. Finally, we identified 51 key “hub” diseases that are the focal points in the race-centric networks and of par-ticular clinical importance. Incorporating race-specific disease co-morbidity patterns will produce a more accurate and complete picture of the disease landscape overall and could support more precise understanding of disease relationships and patient management towards improved clinical outcomes.

                                                                                                                                                    TP097 (PT) - Using genomic annotations increases statistical power to detect eGenes
                                                                                                                                                    Date: Tuesday, July 12 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Dat Duong, UCLA, United States
                                                                                                                                                    • Jennifer Zou, UCLA, United States
                                                                                                                                                    • Farhad Hormozdiari, School of Computing Science, UCLA, United States
                                                                                                                                                    • Jae Hoon Sul, Brigham and Women's Hospital, Boston, USA, United States
                                                                                                                                                    • Jason Ernst, UCLA, United States
                                                                                                                                                    • Buhm Han, Asan Institute for Life Sciences, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea, Korea, Republic of
                                                                                                                                                    • Eleazar Eskin, University of California, Los Angeles, United States

                                                                                                                                                    Area Session Chair: Janet Kelso

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Expression quantitative trait loci (eQTL) are genetic variants
                                                                                                                                                    that affect gene expression. In eQTL studies, one important task
                                                                                                                                                    is to find eGenes or genes whose expressions are associated with at
                                                                                                                                                    least one eQTL. The standard statistical method to determine if a
                                                                                                                                                    gene is an eGene requires association testing at all nearby variants
                                                                                                                                                    and the permutation test to correct for multiple testing. The standard
                                                                                                                                                    method however does not consider genomic annotation of the
                                                                                                                                                    variants. In practice, variants near gene transcription start sites or
                                                                                                                                                    certain histone modifications are likely to regulate gene expression.
                                                                                                                                                    In this paper, we introduce a novel eGene detection method that
                                                                                                                                                    considers this empirical evidence and thereby increases the statistical
                                                                                                                                                    power. We applied our method to the liver Genotype-Tissue Expression
                                                                                                                                                    (GTEx) data using distance from transcription start sites, DNase
                                                                                                                                                    hypersensitivity sites, and six histone modifications as the genomic
                                                                                                                                                    annotations for the variants. Each of these annotations helped us
                                                                                                                                                    detected more candidate eGenes. Distance from transcription start
                                                                                                                                                    site appears to be the most important annotation; specifically, using
                                                                                                                                                    this annotation, our method discovered 50% more candidate eGenes
                                                                                                                                                    than the standard permutation method.

                                                                                                                                                    TP098 (PT) - Simultaneous prediction of enzyme orthologs from chemical transformation patterns for de novo metabolic pathway reconstruction
                                                                                                                                                    Date: Tuesday, July 12 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS / PROTEINS
                                                                                                                                                    • Yasuo Tabei, Japan Science and Technology Agency, Japan
                                                                                                                                                    • Yoshihiro Yamanishi, Kyushu University, Japan
                                                                                                                                                    • Masaaki Kotera, Tokyo Institute of Technology, Japan

                                                                                                                                                    Area Session Chair: Trey Ideker

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation:
                                                                                                                                                    Metabolic pathways are an important class of molecular networks consisting of compounds, enzymes, and their interactions.
                                                                                                                                                    The understanding of global metabolic pathways is extremely important for various applications in ecology and pharmacology.
                                                                                                                                                    However, large parts of metabolic pathways remain unknown, and most organism-specific pathways contain many missing enzymes.
                                                                                                                                                    Results:
                                                                                                                                                    In this study we propose a novel method to predict the enzyme orthologs that catalyze the putative reactions to facilitate the de novo reconstruction of metabolic pathways from metabolome-scale compound sets.
                                                                                                                                                    The algorithm detects the chemical transformation patterns of substrate-product pairs using chemical graph alignments, and constructs a set of enzyme-specific classifiers to simultaneously predict all the enzyme orthologs that could catalyze the putative reactions of the substrate-product pairs in the joint learning framework.
                                                                                                                                                    The originality of the method lies in its ability to make predictions for thousands of enzyme orthologs simultaneously, as well as its extraction of enzyme-specific chemical transformation patterns of substrate-product pairs.
                                                                                                                                                    We demonstrate the usefulness of the proposed method by applying it to some ten thousands of metabolic compounds,
                                                                                                                                                    and analyze the extracted chemical transformation patterns that provide insights into the characteristics and specificities of enzymes.
                                                                                                                                                    The proposed method will open the door to both primary (central) and secondary metabolism in genomics research,
                                                                                                                                                    increasing research productivity to tackle a wide variety of environmental and public health matters.

                                                                                                                                                    TP099 (PT) - Classifying and Segmenting Microscopy Images with Deep Multiple Instance Learning
                                                                                                                                                    Date: Tuesday, July 12 2:00 pm - 2:20 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DATA
                                                                                                                                                    • Oren Kraus, University of Toronto, Canada
                                                                                                                                                    • Lei Jimmy Ba, University of Toronto, Canada
                                                                                                                                                    • Brendan Frey, University of Toronto, Canada

                                                                                                                                                    Area Session Chair: Curtis Huttenhower

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Abstract
                                                                                                                                                    Motivation: High content screening (HCS) technologies have enabled large scale imaging experiments for studying cell biology and for drug screening. These systems produce hundreds of thousands of microscopy images per day and their utility depends on automated image analysis. Recently, deep learning approaches that learn feature representations directly from pixel intensity values have dominated object recognition challenges. These tasks typically have a single centred object per image and existing models are not directly applicable to microscopy datasets. Here we develop an approach that combines deep convolutional neural networks (CNNs) with multiple instance learning (MIL) in order to classify and segment microscopy images using only whole image level annotations.
                                                                                                                                                    Results: We introduce a new neural network architecture that uses MIL to simultaneously classify and segment microscopy images with populations of cells. We base our approach on the similarity between the aggregation function used in MIL and pooling layers used in CNNs. To facilitate aggregating across large numbers of instances in CNN feature maps we present the Noisy-AND MIL pooling function, a new MIL operator that is robust to outliers. Combining CNNs with MIL enables training CNNs using whole microscopy images with image level labels. We show that training end-to-end MIL CNNs outperforms several previous methods on both mammalian and yeast datasets without requiring any segmentation steps.
                                                                                                                                                    Availability: We will make our implementation and training data available for the final version of the manuscript.
                                                                                                                                                    Contact: oren.kraus@mail.utoronto.ca
                                                                                                                                                    Supplementary information: Supplementary data are available at Bioinformatics online.

                                                                                                                                                    TP100 (PT) - GeneiASE: Detection of conditiondependent and static allele-specific expression from RNA-seq data without haplotype information
                                                                                                                                                    Date: Tuesday, July 12 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Daniel Edsgärd, KTH Royal Institute of Technology, Sweden
                                                                                                                                                    • Maria Jesus Iglesias, KTH Royal Institute of Technology, Sweden
                                                                                                                                                    • Sarah-Jayne Reilly, Karolinska Institute, Sweden
                                                                                                                                                    • Anders Hamsten, Karolinska Institute, Sweden
                                                                                                                                                    • Per Tornvall, Karolinska Institutet, Sweden
                                                                                                                                                    • Jacob Odeberg, Karolinska Insitutet, Sweden
                                                                                                                                                    • Olof Emanuelsson, KTH Royal Institute of Technology, Sweden

                                                                                                                                                    Area Session Chair: Janet Kelso

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Allele-specific expression (ASE) is the imbalance in transcription between maternal and paternal alleles at a locus and can be probed in single individuals using massively parallel DNA sequencing technology. Assessing ASE within a single sample provides a static picture of the ASE, but the magnitude of ASE for a given transcript may vary between different biological conditions in an individual. Such condition-dependent ASE could indicate a genetic variation with a functional role in the phenotypic difference. We developed a method, GeneiASE, to detect genes exhibiting static or condition-dependent ASE in single individuals. GeneiASE performed consistently over a range of read depths and ASE effect sizes, and did not require phasing of variants to estimate haplotypes. We applied GeneiASE on both our own and publicly available data sets, and validated a number of ASE cases using qPCR. GeneiASE is available at https://sourceforge.net/projects/geneiase/.

                                                                                                                                                    TP101 (PT) - Fast metabolite identification with Input Output Kernel Regression
                                                                                                                                                    Date: Tuesday, July 12 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS / DATA
                                                                                                                                                    • Céline Brouard, Aalto university, Finland
                                                                                                                                                    • Huibin Shen, Aalto University, Finland
                                                                                                                                                    • Kai Dührkop, Friedrich-Schiller-University Jena, Germany
                                                                                                                                                    • Florence D'Alché-buc, Télécom ParisTech/Institut Mines-Télécom, France
                                                                                                                                                    • Sebastian Böcker, Friedrich Schiller University Jena, Germany
                                                                                                                                                    • Juho Rousu, Aalto University, Finland

                                                                                                                                                    Area Session Chair: Trey Ideker

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprints and matching these fingerprints against existing databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach.
                                                                                                                                                    We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the structures in input and output with an output kernel and an operator-valued kernel in input. The mapping between the two structured sets is approximated by learning a function with values in the feature space associated to the output kernel and solving a pre-image problem for the prediction step. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods.

                                                                                                                                                    TP102 (PT) - PHOCOS: Inferring Multi-Feature Phenotypic Crosstalk Networks
                                                                                                                                                    Date: Tuesday, July 12 2:20 pm - 2:40 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DATA
                                                                                                                                                    • Yue Deng, School of Pharmacy, UCSF, United States
                                                                                                                                                    • Steven Altschuler, School of Pharmacy, UCSF, United States
                                                                                                                                                    • Lani Wu, School of Pharmacy, UCSF, United States

                                                                                                                                                    Area Session Chair: Curtis Huttenhower

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Quantification of cellular changes to perturbations can provide a powerful approach to infer crosstalk among molecular components in biological networks. Existing crosstalk inference methods conduct network-structure learning based on a single phenotypic feature (e.g. abundance) of a biomarker. These approaches are insufficient for analyzing perturbation data that can contain information about multiple features (e.g. abundance, activity or localization) of each biomarker.
                                                                                                                                                    Results: We propose a computational framework for inferring phenotypic crosstalk (PHOCOS) that is suitable for high-content microscopy or other modalities that capture multiple phenotypes per biomarker. PHOCOS uses a robust graph-learning paradigm to predict direct effects from potential indirect effects and identify errors due to noise or missing links. The result is a multi-feature, sparse network that parsimoniously captures direct and strong interactions across phenotypic attributes of multiple biomarkers. We use simulated and biological data to demonstrate the ability of PHOCOS to recover multi-attribute crosstalk networks from cellular perturbation assays.

                                                                                                                                                    TP103 (PT) - Data-driven mechanistic analysis method to reveal dynamically evolving regulatory networks
                                                                                                                                                    Date: Tuesday, July 12 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / SYSTEMS
                                                                                                                                                    • Jukka Intosalmi, Aalto University, Finland
                                                                                                                                                    • Kari Nousiainen, Aalto University, Finland
                                                                                                                                                    • Helena Ahlfors, The Babraham Institute, United Kingdom
                                                                                                                                                    • Harri Lähdesmäki, Aalto University, Finland

                                                                                                                                                    Area Session Chair: Janet Kelso

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Mechanistic models based on ordinary differential equations provide powerful and accurate means to describe the dynamics of molecular machinery which orchestrates gene regulation. When combined with appropriate statistical techniques, mechanistic models can be calibrated using experimental data and, in many cases, also the model structure can be inferred from time-course measurements. However, existing mechanistic models are limited in the sense that they rely on the assumption of static network structure and cannot be applied when transient phenomena affect, or rewire, the network structure. In the context of gene regulatory network inference, network rewiring results from the net impact of possible unobserved transient phenomena such as changes in signaling pathway activities or epigenome, which are generally difficult, but important, to account for.

                                                                                                                                                    We introduce a novel method that can be used to infer dynamically evolving regulatory networks from time-course data. Our method is based on the notion that all mechanistic ordinary differential equation models can be coupled with a latent process that approximates the network structure rewiring process. We illustrate the performance of the method using simulated data and, further, we apply the method to study the regulatory interactions during T helper 17 cell differentiation using time-course RNA sequencing data. The computational experiments with the real data show that our method is capable of capturing the experimentally verified rewiring effects of the core Th17 regulatory network. We predict Th17 lineage specific subnetworks that are activated sequentially and control the differentiation process in an overlapping manner.

                                                                                                                                                    TP104 (PT) - Faster and More Accurate Graphical Model Identification of Tandem Mass Spectra using Trellises
                                                                                                                                                    Date: Tuesday, July 12 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: PROTEINS
                                                                                                                                                    • Shengjie Wang, University of Washington, United States
                                                                                                                                                    • John Halloran, University of Washington, United States
                                                                                                                                                    • Jeff Bilmes, University of Washington, United States
                                                                                                                                                    • William Stafford Noble, University of Washington, United States

                                                                                                                                                    Area Session Chair: Trey Ideker

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Tandem mass spectrometry (MS/MS) is the dominant high throughput technology for identifying and quantifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by an MS/MS experiment begins by assigning to each observed spectrum the peptide that is hypothesized to be responsible for generating the spectrum. This assignment is typically done by search- ing each spectrum against a database of peptides. To our knowledge, all existing MS/MS search engines compute scores individually between a given observed spectrum and each possible candidate peptide from the database. In this work, we use a trellis, a data structure capable of jointly representing a large set of candidate peptides, to avoid redundantly recomputing common sub-computations among different candidates. We show how trellises may be used to significantly speed up existing scoring algorithms, and we theoretically quantify the expected speed-up afforded by trellises. Furthermore, we demonstrate that compact trellis representations of whole sets of peptides enables efficient discriminative learning of a dynamic Bayesian network for spectrum identification, leading to greatly improved peptide identification accuracy.

                                                                                                                                                    TP105 (PT) - CD30 cell graphs of Hodgkin lymphoma are not scale-free—an image analysis approach
                                                                                                                                                    Date: Tuesday, July 12 2:40 pm - 3:00 pm
                                                                                                                                                    Room: Northern Hemisphere E1/E2
                                                                                                                                                    Theme: DATA / DISEASE
                                                                                                                                                    • Hendrik Schäfer, Johann Wolfgang Goethe Universität, Germany
                                                                                                                                                    • Tim Schäfer, Institute of Computer Science, Department of Molecular Bioinformatics, Germany
                                                                                                                                                    • Joerg Ackermann, Johann Wolfgang Goethe Universität, Germany
                                                                                                                                                    • Norbert Dichter, Institute of Computer Science, Department of Molecular Bioinformatics, Germany
                                                                                                                                                    • Claudia Döring, Senckenberg Institute of Pathology, Germany
                                                                                                                                                    • Sylvia Hartmann, Senckenberg Institute of Pathology, Germany
                                                                                                                                                    • Martin-Leo Hansmann, Senckenberg Institute of Pathology, Germany
                                                                                                                                                    • Ina Koch, Johann Wolfgang Goethe University Frankfurt am Main, Germany

                                                                                                                                                    Area Session Chair: Curtis Huttenhower

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    In this talk, we present an investigation from the field of digital pathology. Using whole slide images, we analyzed the cell distribution of CD30 positive cells in Hodgkin lymphoma (HL). HL is a malignancy of the immune system that usually originates from B cells. For diagnosis, biopsies are taken from patients and immunostained. We detected cells in digitized versions of the images using a custom imaging pipeline. The spatial distribution of CD30 cells in the tissue was modeled as a CD30 cell graph. We found that the cell distribution in the tissue is not random. The cells show pronounced clustering in the tissue, which is higher for the lymphoma cases. The vertex degree distributions of the graphs could be modeled by the Gamma distribution, and thus were not scale-free. Our findings are a first step towards modeling the complex spatial interactions of different cell types in the lymph node.

                                                                                                                                                    TP106 (PT) - A novel method for discovering local spatial clusters of genomic regions with functional relationships from DNA contact maps
                                                                                                                                                    Date: Tuesday, July 12 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Xihao Hu, The Chinese University of Hong Kong, Hong Kong
                                                                                                                                                    • Christina Huan Shi, The Chinese University of Hong Kong, Hong Kong
                                                                                                                                                    • Kevin Yip, The Chinese University of Hong Kong, Hong Kong

                                                                                                                                                    Area Session Chair: Janet Kelso

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: The three-dimensional structure of genomes makes it possible for genomic regions not adjacent in the primary sequence to be spatially proximal. These DNA contacts have been found to be related to various molecular activities. Previous methods for analyzing DNA contact maps obtained from Hi-C experiments have largely focused on studying individual interactions, forming spatial clusters composed of contiguous blocks of genomic locations, or classifying these clusters into general categories based on some global properties of the contact maps.

                                                                                                                                                    Results: Here we describe a novel computational method that can flexibly identify small clusters of spatially proximal genomic regions based on their local contact patterns. Using simulated data that highly resemble Hi-C data obtained from real genome structures, we demonstrate that our method identifies spatial clusters that are more compact than methods previously used for clustering genomic regions based on DNA contact maps. The clusters identified by our method enable us to confirm functionally-related genomic regions previously reported to be spatially proximal in different species. We further show that each genomic region can be assigned a numeric affinity value that indicates its degree of participation in each local cluster, and these affinity values correlate quantitatively with DNase I hypersensitivity, gene expression, super enhancer activities and replication timing in a cell type specific manner. We also show that these cluster affinity values can precisely define boundaries of reported topologically associating domains (TADs), and further define local sub-domains within each domain.

                                                                                                                                                    Availability: The source code of BNMF and tutorials on how to use the software to extract local clusters from contact maps are available at http://yiplab.cse.cuhk.edu.hk/bnmf/ .

                                                                                                                                                    TP107 (PT) - BioASF: A Framework for Automatically Generating Executable Pathway Models Specified in BioPAX
                                                                                                                                                    Date: Tuesday, July 12 3:30 pm - 3:50 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS / DATA
                                                                                                                                                    • Reza Haydarlou, VU University Amsterdam, Netherlands
                                                                                                                                                    • Annika Jacobsen, VU University Amsterdam, Netherlands
                                                                                                                                                    • Nicola Bonzanni, VU University Amsterdam, Netherlands
                                                                                                                                                    • K. Anton Feenstra, VU University Amsterdam, Netherlands
                                                                                                                                                    • Sanne Abeln, VU University, Netherlands
                                                                                                                                                    • Jaap Heringa, VU University Amsterdam, Netherlands

                                                                                                                                                    Area Session Chair: Trey Ideker

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    ABSTRACT
                                                                                                                                                    Motivation: Biological pathways play a key role in most cellular functions.
                                                                                                                                                    To better understand these functions, diverse computational
                                                                                                                                                    and cell biology researchers use biological pathway data for various
                                                                                                                                                    analysis and modeling purposes. For specifying these biological pathways,
                                                                                                                                                    a community of researchers has defined BioPAX and provided
                                                                                                                                                    various tools for creating, validating, and visualizing BioPAX models.
                                                                                                                                                    However, a generic software framework for simulating BioPAX models
                                                                                                                                                    is missing. Here, we attempt to fill this gap by introducing a generic
                                                                                                                                                    simulation framework for BioPAX. The framework explicitly separates
                                                                                                                                                    the execution model from the model structure as provided by BioPAX,
                                                                                                                                                    with the advantage that the modelling process becomes more reproducible
                                                                                                                                                    and intrinsically more modular; this ensures natural biological
                                                                                                                                                    constraints are satisfied upon execution. The framework is based
                                                                                                                                                    on the principles of discrete event systems and multi-agent systems,
                                                                                                                                                    and is capable of automatically generating a hierarchical multi-agent
                                                                                                                                                    system for a given BioPAX model.
                                                                                                                                                    Results: To demonstrate the applicability of the framework, we
                                                                                                                                                    simulated two types of biological network models: a gene regulatory
                                                                                                                                                    network modeling the haematopoietic stem cell regulators and a
                                                                                                                                                    signal transduction network modeling the Wnt/B-catenin signaling
                                                                                                                                                    pathway. We observed that the results of the simulations performed
                                                                                                                                                    using our framework were entirely consistent with the simulation
                                                                                                                                                    results reported by the researchers who developed the original
                                                                                                                                                    models in a proprietary language.
                                                                                                                                                    Availability and Implementation: The framework, implemented in
                                                                                                                                                    Java, is open source and its source code, documentation, and tutorial
                                                                                                                                                    are available at http://www.ibi.vu.nl/programs/BioASF.
                                                                                                                                                    Contact: j.heringa@vu.nl

                                                                                                                                                    TP108 (PT) - Tracking the Evolution of 3D Gene Organization
                                                                                                                                                    Date: Tuesday, July 12 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES
                                                                                                                                                    • Alon Diament, Tel Aviv University, Israel
                                                                                                                                                    • Tamir Tuller, Tel Aviv University, Israel

                                                                                                                                                    Area Session Chair: Janet Kelso

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The study of eukaryotic genomic organization has been rapidly advancing in recent years, with next generation sequencing technologies, such as Hi-C, providing large scale measurements of 3D genomic organization at unprecedented resolution. It has recently been shown that the distribution of genes in eukaryotic genomes is not random and that their organization is strongly related to gene expression and function. It has also been shown that some level of conservation of this organization exists between organisms. However, almost all studies of 3D genomic organization analyzed each organism independently from others.

                                                                                                                                                    Here we propose a novel approach for inter-organismal analysis of the evolution of the 3D organization of genes based on a network representation of Hi-C data from S. cerevisiae and S. pombe. We report global signals of conservation and re-organization of genes in the genome, that are correlated with changes in their functionality and expression. Furthermore, we describe algorithms for identifying spatially co-evolving orthologous modules (SCOMs) and demonstrate them for various proposed types of modules, including: modules of co-localizing genes with conserved 3D positions; modules of genes that underwent significant changes in their 3D co-localization during evolution; and additional more complex gene arrangements.

                                                                                                                                                    We show that this approach enables identifying biologically relevant modules of co-evolving genes with shared function. The approach is expected to contribute to the study of genome evolution, gene expression, and even tumorigenesis.

                                                                                                                                                    TP109 (PT) - PSAMM: A Portable System for the Analysis of Metabolic Models
                                                                                                                                                    Date: Tuesday, July 12 3:50 pm - 4:10 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS
                                                                                                                                                    • Jon Steffensen, University of Rhode Island, United States
                                                                                                                                                    • Keith Dufault-Thompson, University of Rhode Island, United States
                                                                                                                                                    • Ying Zhang, University of Rhode Island, United States

                                                                                                                                                    Area Session Chair: Trey Ideker

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    The broad application of genome-scale metabolic modeling has made it a useful technique for tackling fundamental questions in biological research and engineering. Today over 100 models have been constructed for organisms of diverse metabolic activities spanning all three kingdoms of life. These models, however, have been curated independently following different conventions. The maintenance of model consistency has been challenging due to the lack of consensus in model representation and the absence of integrated modeling software for associating mathematical simulations with the annotation and biological interpretation of metabolic models. To solve this problem, we developed a new software package, PSAMM, and a new model format that incorporates heterogeneous, model-specific annotation information into modular representations of model definitions and simulation settings. PSAMM provides significant advances in standardizing the workflow of model annotation and consistency checking. Compared to existing tools, PSAMM supports more flexible configurations and is more efficient in running constraint-based simulations.

                                                                                                                                                    TP110 (PT) - A Low-Latency, Big Database System and Browser for Storage, Querying and Visualization of 3D Genomic Data
                                                                                                                                                    Date: Tuesday, July 12 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere A1/A2
                                                                                                                                                    Theme: GENES / DATA
                                                                                                                                                    • Alexander Butyaev, McGill University, Canada
                                                                                                                                                    • Ruslan Mavlyutov, University of Fribourg, Switzerland
                                                                                                                                                    • Mathieu Blanchette, McGill University, Canada
                                                                                                                                                    • Philippe Cudré-Mauroux, University of Fribourg, Switzerland
                                                                                                                                                    • Jérôme Waldispühl, McGill University, Canada

                                                                                                                                                    Area Session Chair: Janet Kelso

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Recent releases of genome three-dimensional (3D) structures have the potential to transform our understanding of genomes. Nonetheless, the storage technology and visualization tools need to evolve to offer to the scientific community fast and convenient access to these data. We introduce simultaneously a database system to store and query 3D genomic data (3DBG), and a 3D genome browser to visualize and explore 3D genome structures (3DGB). We benchmark 3DBG against state-of-the-art systems and demonstrate that it is faster than previous solutions, and importantly gracefully scales with the size of data. We also illustrate the usefulness of our 3D genome Web browser to explore human genome structures.
                                                                                                                                                    The 3D genome browser is available at http://3dgb.cs.mcgill.ca/.

                                                                                                                                                    TP111 (PT) - Linear effects models of signaling pathways from combinatorial perturbation data
                                                                                                                                                    Date: Tuesday, July 12 4:10 pm - 4:30 pm
                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                    Theme: SYSTEMS
                                                                                                                                                    • Ewa Szczurek, University of Warsaw, Poland
                                                                                                                                                    • Niko Beerenwinkel, ETH Zurich, Switzerland

                                                                                                                                                    Area Session Chair: Trey Ideker

                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                    Motivation: Perturbations constitute the central means to study signaling pathways. Interrupting
                                                                                                                                                    components of the pathway and analyzing observed effects of those interruptions can give insight into
                                                                                                                                                    unknown connections within the signaling pathway itself, as well as the link from the pathway to the effects. Different pathway components may have different individual contributions to the measured perturbation effects, such as gene expression changes. Those effects will be observed in combination when the pathway components are perturbed. Extant approaches focus either on the reconstruction of pathway structure or on resolving how the pathway components control the downstream effects.
                                                                                                                                                    Results: Here, we propose a linear effects model, which can be applied to infer both from combinatorial
                                                                                                                                                    perturbation data. We use simulated data to demonstrate the accuracy of learning the pathway structure
                                                                                                                                                    as well as estimation of the individual contributions of pathway components to the perturbation effects.
                                                                                                                                                    The practical utility of our approach is illustrated by an application to perturbations of the mitogen-activated protein kinase pathway in Saccharomyces cerevisiae.
                                                                                                                                                    Availability: lem is available as a R package at http://www.mimuw.edu.pl/~szczurek/lem
                                                                                                                                                    Contact: niko.beerenwinkel@bsse.ethz.ch
                                                                                                                                                    Supplementary information: Supplementary data are available at Bioinformatics online.

                                                                                                                                                    TT01 (PT) -
                                                                                                                                                    Date: Sunday, July 10 6:00 pm - 7:00 pm
                                                                                                                                                    Room: America's Seminar
                                                                                                                                                    Theme:

                                                                                                                                                      Area Session Chair: Rodrigo Lopez

                                                                                                                                                      Presentation Overview: Show

                                                                                                                                                      TT02 (PT) -
                                                                                                                                                      Date: Monday, July 11 10:10 am - 12:40 pm
                                                                                                                                                      Room: America's Seminar
                                                                                                                                                      Theme:

                                                                                                                                                        Area Session Chair: Rodrigo Lopez or Des Higgins

                                                                                                                                                        Presentation Overview: Show

                                                                                                                                                        TT03 (PT) -
                                                                                                                                                        Date: Monday, July 11 2:00 pm - 3:00 pm
                                                                                                                                                        Room: America's Seminar
                                                                                                                                                        Theme:

                                                                                                                                                          Area Session Chair: Rodrigo Lopez or Des Higgins

                                                                                                                                                          Presentation Overview: Show

                                                                                                                                                          TT04 (PT) -
                                                                                                                                                          Date: Monday, July 11 3:30 pm - 4:30 pm
                                                                                                                                                          Room: America's Seminar
                                                                                                                                                          Theme:

                                                                                                                                                            Area Session Chair: Rodrigo Lopez or Des Higgins

                                                                                                                                                            Presentation Overview: Show

                                                                                                                                                            TT05 (PT) -
                                                                                                                                                            Date: Monday, July 11 6:00 pm - 6:20 pm
                                                                                                                                                            Room: Northern Hemisphere A1/A2
                                                                                                                                                            Theme:

                                                                                                                                                              Area Session Chair: Rodrigo Lopez

                                                                                                                                                              Presentation Overview: Show

                                                                                                                                                              TT06 (PT) -
                                                                                                                                                              Date: Monday, July 11 6:00 pm - 6:20 pm
                                                                                                                                                              Room: Northern Hemisphere A3/A4
                                                                                                                                                              Theme:

                                                                                                                                                                Area Session Chair: Des Higgins

                                                                                                                                                                Presentation Overview: Show

                                                                                                                                                                TT07 (PT) -
                                                                                                                                                                Date: Monday, July 11 6:00 pm - 6:20 pm
                                                                                                                                                                Room: America's Seminar
                                                                                                                                                                Theme:

                                                                                                                                                                  Area Session Chair: Dominic Clark

                                                                                                                                                                  Presentation Overview: Show

                                                                                                                                                                  TT08 (PT) -
                                                                                                                                                                  Date: Monday, July 11 6:20 pm - 6:40 pm
                                                                                                                                                                  Room: Northern Hemisphere A1/A2
                                                                                                                                                                  Theme:

                                                                                                                                                                    Area Session Chair: Rodrigo Lopez

                                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                                    TT09 (PT) -
                                                                                                                                                                    Date: Monday, July 11 6:20 pm - 6:40 pm
                                                                                                                                                                    Room: Northern Hemisphere A3/A4
                                                                                                                                                                    Theme:

                                                                                                                                                                      Area Session Chair: Des Higgins

                                                                                                                                                                      Presentation Overview: Show

                                                                                                                                                                      TT10 (PT) -
                                                                                                                                                                      Date: Monday, July 11 6:20 pm - 6:40 pm
                                                                                                                                                                      Room: America's Seminar
                                                                                                                                                                      Theme:

                                                                                                                                                                        Area Session Chair: Dominic Clark

                                                                                                                                                                        Presentation Overview: Show

                                                                                                                                                                        TT11 (PT) -
                                                                                                                                                                        Date: Monday, July 11 6:40 pm - 7:00 pm
                                                                                                                                                                        Room: Northern Hemisphere A1/A2
                                                                                                                                                                        Theme:

                                                                                                                                                                          Area Session Chair: Rodrigo Lopez

                                                                                                                                                                          Presentation Overview: Show

                                                                                                                                                                          TT12 (PT) -
                                                                                                                                                                          Date: Tuesday, July 12 10:10 am - 10:30 am
                                                                                                                                                                          Room: America's Seminar
                                                                                                                                                                          Theme:

                                                                                                                                                                            Area Session Chair: Des Higgins

                                                                                                                                                                            Presentation Overview: Show

                                                                                                                                                                            TT13 (PT) -
                                                                                                                                                                            Date: Tuesday, July 12 10:30 am - 10:50 am
                                                                                                                                                                            Room: America's Seminar
                                                                                                                                                                            Theme:

                                                                                                                                                                              Area Session Chair: Des Higgins

                                                                                                                                                                              Presentation Overview: Show

                                                                                                                                                                              TT14 (PT) -
                                                                                                                                                                              Date: Tuesday, July 12 10:50 am - 11:10 am
                                                                                                                                                                              Room: America's Seminar
                                                                                                                                                                              Theme:

                                                                                                                                                                                Area Session Chair: Des Higgins

                                                                                                                                                                                Presentation Overview: Show

                                                                                                                                                                                TT15 (PT) -
                                                                                                                                                                                Date: Tuesday, July 12 11:40 am - 12:00 pm
                                                                                                                                                                                Room: America's Seminar
                                                                                                                                                                                Theme:

                                                                                                                                                                                  Area Session Chair: Des Higgins

                                                                                                                                                                                  Presentation Overview: Show

                                                                                                                                                                                  TT16 (PT) -
                                                                                                                                                                                  Date: Tuesday, July 12 12:00 pm - 12:20 pm
                                                                                                                                                                                  Room: America's Seminar
                                                                                                                                                                                  Theme:

                                                                                                                                                                                    Area Session Chair: Des Higgins

                                                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                                                    TT17 (PT) -
                                                                                                                                                                                    Date: Tuesday, July 12 12:20 pm - 12:40 pm
                                                                                                                                                                                    Room: America's Seminar
                                                                                                                                                                                    Theme:

                                                                                                                                                                                      Area Session Chair: Des Higgins

                                                                                                                                                                                      Presentation Overview: Show

                                                                                                                                                                                      TT18 (PT) -
                                                                                                                                                                                      Date: Tuesday, July 12 2:00 pm - 2:20 pm
                                                                                                                                                                                      Room: America's Seminar
                                                                                                                                                                                      Theme:

                                                                                                                                                                                        Area Session Chair: Dominic Clark

                                                                                                                                                                                        Presentation Overview: Show

                                                                                                                                                                                        TT19 (PT) -
                                                                                                                                                                                        Date: Tuesday, July 12 2:00 pm - 2:20 pm
                                                                                                                                                                                        Room: Northern Hemisphere E3/E4
                                                                                                                                                                                        Theme:

                                                                                                                                                                                          Area Session Chair: Rodrigo Lopez

                                                                                                                                                                                          Presentation Overview: Show

                                                                                                                                                                                          TT20 (PT) -
                                                                                                                                                                                          Date: Tuesday, July 12 2:20 pm - 2:40 pm
                                                                                                                                                                                          Room: America's Seminar
                                                                                                                                                                                          Theme:

                                                                                                                                                                                            Area Session Chair: Dominic Clark

                                                                                                                                                                                            Presentation Overview: Show

                                                                                                                                                                                            TT21 (PT) -
                                                                                                                                                                                            Date: Tuesday, July 12 2:20 pm - 2:40 pm
                                                                                                                                                                                            Room: Northern Hemisphere E3/E4
                                                                                                                                                                                            Theme:

                                                                                                                                                                                              Area Session Chair: Rodrigo Lopez

                                                                                                                                                                                              Presentation Overview: Show

                                                                                                                                                                                              TT22 (PT) -
                                                                                                                                                                                              Date: Tuesday, July 12 2:40 pm - 3:00 pm
                                                                                                                                                                                              Room: America's Seminar
                                                                                                                                                                                              Theme:

                                                                                                                                                                                                Area Session Chair: Dominic Clark

                                                                                                                                                                                                Presentation Overview: Show

                                                                                                                                                                                                TT23 (PT) -
                                                                                                                                                                                                Date: Tuesday, July 12 2:40 pm - 3:00 pm
                                                                                                                                                                                                Room: Northern Hemisphere E3/E4
                                                                                                                                                                                                Theme:

                                                                                                                                                                                                  Area Session Chair: Rodrigo Lopez

                                                                                                                                                                                                  Presentation Overview: Show

                                                                                                                                                                                                  TT24 (PT) -
                                                                                                                                                                                                  Date: Tuesday, July 12 3:30 pm - 4:30 pm
                                                                                                                                                                                                  Room: Northern Hemisphere E1/E2
                                                                                                                                                                                                  Theme:

                                                                                                                                                                                                    Area Session Chair: Des Higgins

                                                                                                                                                                                                    Presentation Overview: Show

                                                                                                                                                                                                    TT25 (PT) -
                                                                                                                                                                                                    Date: Tuesday, July 12 3:30 pm - 3:50 pm
                                                                                                                                                                                                    Room: America's Seminar
                                                                                                                                                                                                    Theme:

                                                                                                                                                                                                      Area Session Chair: Dominic Clark

                                                                                                                                                                                                      Presentation Overview: Show

                                                                                                                                                                                                      TT26 (PT) -
                                                                                                                                                                                                      Date: Tuesday, July 12 3:30 pm - 3:50 pm
                                                                                                                                                                                                      Room: Northern Hemisphere E3/E4
                                                                                                                                                                                                      Theme:

                                                                                                                                                                                                        Area Session Chair: Rodrigo Lopez or Des Higgins

                                                                                                                                                                                                        Presentation Overview: Show

                                                                                                                                                                                                        TT27 (PT) -
                                                                                                                                                                                                        Date: Tuesday, July 12 3:50 pm - 4:10 pm
                                                                                                                                                                                                        Room: America's Seminar
                                                                                                                                                                                                        Theme:

                                                                                                                                                                                                          Area Session Chair: Dominic Clark

                                                                                                                                                                                                          Presentation Overview: Show

                                                                                                                                                                                                          TT28 (PT) -
                                                                                                                                                                                                          Date: Tuesday, July 12 3:50 pm - 4:10 pm
                                                                                                                                                                                                          Room: Northern Hemisphere E3/E4
                                                                                                                                                                                                          Theme:

                                                                                                                                                                                                            Area Session Chair: Rodrigo Lopez or Des Higgins

                                                                                                                                                                                                            Presentation Overview: Show

                                                                                                                                                                                                            TT29 (PT) -
                                                                                                                                                                                                            Date: Tuesday, July 12 4:10 pm - 4:30 pm
                                                                                                                                                                                                            Room: America's Seminar
                                                                                                                                                                                                            Theme:

                                                                                                                                                                                                              Area Session Chair: Dominic Clark

                                                                                                                                                                                                              Presentation Overview: Show

                                                                                                                                                                                                              TT30 (PT) -
                                                                                                                                                                                                              Date: Tuesday, July 12 4:10 pm - 4:30 pm
                                                                                                                                                                                                              Room: Northern Hemisphere E3/E4
                                                                                                                                                                                                              Theme:

                                                                                                                                                                                                                Area Session Chair: Rodrigo Lopez or Des Higgins

                                                                                                                                                                                                                Presentation Overview: Show

                                                                                                                                                                                                                WK02 Part B (PT) - How to Scale Science and People Using the Cloud.
                                                                                                                                                                                                                Date: Monday, July 11th 10:10 - 10:30 a.m.
                                                                                                                                                                                                                Room: E3E4
                                                                                                                                                                                                                Theme:

                                                                                                                                                                                                                  Presentation Overview: Show