ISMB/ECCB 2019 - Special Sessions
- SST01: Text Mining for Biology and Healthcare
- SST02: Scalable Plant-Research in Cloud Environments
- SST03: Social media mining for drug discovery research: challenges and opportunities of Real World Text
- SST04: Computational Oncology – Heterogeneity and Immune Defence
- SST05: Omics Data Formats, Compression and Storage: Present and Future
- SST06: CAID: The Critical Assessment of Intrinsic protein Disorder
- SST07: Reproducibility of findings from big data. From vision to reality
Room: Room Kairo 1/2 (Ground Floor) Monday, July 22 10:15 a.m. - 6:00 pm
Robert Leaman, National Library of Medicine, National Center for Biotechnology Information, United States
Lars Juhl Jensen, University of Copenhagen, Novo Nordisk Foundation Center for Protein Research, Denmark
Cecilia Arighi, University of Delaware, Delaware Biotechnology Institute, United States
Zhiyong Lu, National Library of Medicine, National Center for Biotechnology Information, United States
Text mining methods for biology and healthcare have matured significantly in recent years. The quality of text mining systems has improved considerably not only in terms of accuracy, but also in interoperability, scalability, and a lower barrier of entry for non-specialists. Much of current research in text mining is published as open source software, making state-of-the-art tools (e.g. PubTator) widely available. Moreover, the use of text mining methods to support other research in the biological and medical sciences has been increasing. Numerous databases use text mining — either to speed up curation (e.g. UniProt) or for directly integrating evidence (e.g. STRING) — and literature databases (e.g. PubMed) have a long history of using text mining techniques to improve search capabilities.
The previous BioLINK special interest group (SIG) successfully organized meetings at ISMB and collaborations with other SIGs for many years. Since the time that the BioLINK SIG was discontinued, however, biomedical text mining has advanced significantly. The use of textual genres outside of published literature has greatly expanded, including patents, drug labels, social media and, most notably, clinical records. At the same time, a number of new computational technologies have emerged that have led to improved accuracy, increased scalability and expanded the number of applications and use cases (e.g. accelerating drug discovery). Interest in text mining at ISMB has continued: ISMB has consistently published text mining research, even without a specific community of special interest (COSI).
Given the lack of a specific community of special interest (COSI), we propose a special session to be held at ISMB 2019 on Text Mining for Biology and Healthcare, to meet the increasing need/interests of computational biologists in such areas, and to bring together researchers that create text mining tools with researchers who currently use or are interested in using text mining tools to make new discoveries. The goal of the session is therefore to link at least two distinct audiences: those who are not text mining specialists, but who could use the results in their work (e.g., bioinformaticians and computational biologists), and biomedical text mining specialists who develop new methodologies to advance the state of the art. We therefore propose focusing on text mining use cases (concrete problems with scientific importance) in addition to methodology development.
Room: Shanghai 3/4 (Ground Floor) Wednesday July 24, 10:15 a.m. - 6:00 p.m.
Dr. Frederik Coppens, VIB, Belgium
This session will give participants an introduction to what is needed to use and create cloud-enabled bioinformatics pipelines. Speakers from several projects that are already using cloud computing to solve plant related research questions will be featured. Based on their hands-on experience, speakers will showcase usage of cloud computing in their projects, including bottlenecks and learned best practices. We will introduce participants to usage of established as well as emerging data repositories and standards. The focus will be on accessing and using these resources for FAIR data management strategies and integrative analysis leveraging the power and scalability of cloud computing, with a particular emphasis on resources created by the ELIXIR Galaxy working group, and by the larger ELIXIR community.
At the end of the session participants will be able to leverage cloud computing and data resources for their research questions according to best practices, using established production platforms.
Plant research needs to cope with the major challenges of population growth and climate change adaptation. Sequencing of the DNA and RNA of crop and forest plants, as well as their pathogens and pests, has generated enormous quantities of data. High-throughput “omics” technologies are widely used and increasingly important to support plant biology research and breeding of diverse plant species for production of food, feed, fibre and other biomaterials, and bio-energy. Much of this data is found in well established repositories and data resources. However, large-scale automated phenotyping is now possible under controlled and field conditions, and there is classical phenotyping data available in literature and in dispersed databases. This data is heterogeneous, described in diverse ways, and difficult to find and re-use.
Significant advances in plant science can be obtained by integrating available genomic and genotyping data with diverse types of phenotyping data, including field and greenhouse experimental data, molecular, -omics and image data. Although most -omics data, and especially phenomic data, are being generated at an increasing scale from public and private research organizations, the dispersion of datasets and metadata among multiple repositories and their often poor description and annotation, make their use and exploitation still challenging or even unapproachable.
To help unlock the full potential of a multi-omics approach to plant science, it is essential to make plant data interoperable in accordance with the FAIR principles (i.e. Findable, Accessible, Interoperable and Reusable). Several standards have been built for the annotation of data sets.
For phenotyping data, ELIXIR has built an architecture based on the Breeding API (BrAPI, www.brapi.org ), an API for accessing data relevant for plant breeding developed by the international plant community. The implementation of BrAPI endpoints results in a distributed infrastructure for plant phenotyping data, allowing to access diverse datasets in different sites. To enable the interoperability of these datasets, the MIAPPE (www.miappe.org) standard for plant phenotypic data has been further developed and integration into BrAPI is ongoing. These technologies form the basis for a scalable analysis of plant phenotyping data and its integration with data in well established data archives ( www.elixir-europe.org/platforms/data/elixir-deposition-databases) such as
- ArrayExpress (functional data, www.ebi.ac.uk/arrayexpress/ ),
- PRIDE (proteomics, www.ebi.ac.uk/pride/archive/ ),
- MetaboLights (metabolomics, www.ebi.ac.uk/metabolights/ ),
- European Variant Archive (EVA, variant data, www.ebi.ac.uk/eva/ ) and
- the European Nucleotide Archive (ENA, sequencing data, www.ebi.ac.uk/ena/ ).
All this creates a huge demand for compute resources that are easily accessible, scalable, and ideally equipped with a workbench that can handle large datasets and is easily deployable. On this side of the spectrum Cloud computing has gone from cutting edge to standard practice and is no longer solely the domain of computer science professionals. Many cloud providers (both scientific and commercial) exist, and private clouds exist in many universities and research institutions. The basic functionality of running cloud-native workloads can be performed on any of them, avoiding a lock-in scenario. Users are often not aware that an analysis is cloud-powered.
Projects like Bioconda ( bioconda.github.io ) and Biocontainers ( biocontainers.pro ) provide thousands of bioinformatics tools conveniently packaged for use in cloud environments, paving the way for taking bioinformatics data analysis to the cloud. Simultaneously, workflow environment systems, like Galaxy ( galaxyproject.org ) and Snakemake ( snakemake.readthedocs.io ) have been adapted to run in the cloud. As a result, cloud environments are now used in many life science research projects, and given its scalability, reproducibility and reduced costs, it is expected that more and more research projects will be conducted in this way.
Room: Shanghai 1/2 (Ground Floor) Wednesday, July 24, 2:00 p.m. - 6:00 p.m.
Raul Rodriguez-Esteban, Roche, Switzerland
Mathias Leddin, Roche, Switzerland
Juergen Gottowik, Roche, Switzerland
Social media mining in biomedicine is a research area that marries biomedical natural language processing (BioNLP) techniques with the biomedical sciences. This field has been of growing interest since the popularization of microblogs and disease-related forums and blogs. Because it involves handling Real World Data (RWD), it poses unique challenges in comparison to other BioNLP applications such as the mining of scientific documents. Some of these challenges are: higher prevalence of non-English content, use of colloquial and lay language, abundance of noise and junk content, and source format variability. Despite these challenges, social media can bring a unique perspective and knowledge about patients as well as their caregivers, family and friends. In fact, it presents an unfiltered access to the patient’s view on a wide range of topics, including politically incorrect and socially embarrassing ones, while bypassing traditional information gatekeepers such as healthcare providers and patient organizations.
In drug discovery the mining of social media has presented hurdles due to regulatory uncertainties with respect to privacy concerns and handling of adverse event reporting. Recently, however, regulatory agencies have been *encouraging* the use of social media to, for example, better understand patient opinions. Moreover, new privacy regulations have been clarifying the legal framework. Thus, social media mining is opening up as a source of knowledge for the drug discovery process and it is being shown to be relevant to many applications in all stages of this process.
Topics of interest that could be covered in this Special Session include, but are not restricted to: patient outcomes research, patient journey, unmet medical need, symptom and daily-life-impact disease models, evaluation of and compliance with current therapies, adverse-event and signal detection, evaluation of existing standards of care, assessing cultural differences between patients, recruitment optimization, patient burden and disease work, and disease-coping strategies.
While not a typical ISMB topic, this session would be relevant to ISMB attendants interested in applications of biomedical text mining to RWD. Furthermore, it would appeal to attendants with interest in using computational methods to understand disease phenotypes. We also expect interest from attendants working on or researching the drug discovery process using computational methods.
Room: Shanghai 1/2 (Ground Floor) Thursday, July 25, 8:30 a.m. - 12:40 p.m.
Niko Beerenwinkel, ETH Zurich, Switzerland
Francesca D. Ciccarelli, King’s College London, United Kingdom
Jens Lagergren, Royal Institute of Technology & SciLifeLab, Sweden
Cancer is one of the main causes of morbidity and mortality worldwide. Although chemotherapeutic drugs and new targeted treatments have resulted in improved quality of life and prolonged survival for many patients, most tumors and especially metastases still have a severe impact on human health.
Cancer constitutes a group of diseases characterized by abnormal cell growth, stage-wise progression, genomic and cellular heterogeneity, and potential to develop resistance to therapies. All these aspects are consequences of the evolutionary nature of cancer and, consequently, the study of somatic evolution in cancer constitutes a tremendously promising approach to precision oncology, carrying a huge potential medical impact. This is accentuated by the frequent failure of current biomarker and treatment concepts to achieve durable drug response and long-term survival for cancer patients. Simultaneously, the immune system, in particular, antigen-specific T- cells, constitutes an essential component of a tumor’s environment and is an important determinant of the selective pressure acting on it during its evolution. Computational methods play an increasingly important role to address these challenges and have given rise to the field of computational oncology .
This special session will cover two interrelated and clinically important problems in computational oncology, namely, the assessment and interpretation of intra-tumor molecular heterogeneity and the role of immune responses in cancer treatment. Until recently, the dominant method for studying a tumor’s heterogeneity consisted of first obtaining bulk DNA-seq data and then deconvoluting this signal in order to identify a few dominant subpopulations, using computational methods. Today, the heterogeneity is preferably studied based on single-cell data. There are a growing number of methods for analyzing cancer transcriptional and genomic heterogeneity, including reconstruction of tumor phylogenies, from scRNA-seq and scDNA-seq data, respectively, from the ISMB community and beyond. Moreover, studies of the immune system and cancer-immune interactions have become increasingly interesting, since the field of tumor immunotherapy has proven to be an enormous success in the past decade with multiple therapeutic interventions leading to functional cures in several disparate types of cancer.
Room: Shanghai 3/4 (Ground Floor) Thursday, July 25, 8:30 a.m. - 4:40 p.m.
Mikel Hernaez, Carl R. Woese Institute for Genomic Biology University of Illinois at Urbana-Champaign
James Bonfield, Wellcome Sanger Institute
In 2003 the first human genome assembly was completed. It was the end of a project that took almost 13 years to complete and cost 3 billion dollars (around $1 per base pair). This milestone ushered in the genomics era, giving rise to personalized or precision medicine. Fortunately, sequencing cost has drastically decreased in recent years. While in 2004 the cost of sequencing a whole human genome was around $20 million, in 2008 it dropped to a million, and in 2017 to a mere $1000. As a result of this decrease in sequencing cost, as well as advancements in sequencing technology, massive amounts of genomic data are being generated. At the current rate of growth (sequencing data is doubling approximately every seven months), more than an exabyte of sequencing data per year will be produced, approaching the zettabytes by 2025 . As an example, the sequencing data generated by the 1000 Genomes Project (www.1000genoms.org) in the first 6 months exceeded the sequence data accumulated during 21 years in the NCBI GenBank database .
In addition, the generation of other types of omics data are also experiencing rapid growth. For example, DNA methylation data has been found to be important in early detection of tumors and in determining the prognosis of the disease , and as a result, it has been the subject of many large-scale projects including MethylomeDB  and DiseaseMeth , among others. Proteomics and metabolomics studies are also gaining momentum, as they contribute towards a better understanding of the dynamic processes involved in disease, with direct applications in prediction, diagnosis and prognosis , and several repositories have been created, such as PeptideAtlas/PASSEL  and PRIDE .
This situation calls for state-of-the-art, efficiently-compressed representations of massive biological datasets, that can not only alleviate the storage requirements but also facilitate the exchange and dissemination of these data. This undertaking is of paramount importance, as the storage and acquisition of the data are becoming the major bottleneck, as evidenced by the recent flourishing of cloud-based solutions enabling processing the data directly on the cloud. For example, companies such as DNAnexus, GenoSpace, Genome Cloud, and Google Genomics, to name a few, offer solutions to perform genome analysis in the cloud.
This sentiment is also reflected by the NIH Big Data to Knowledge (BD2K) initiative launched in 2013, which acknowledged the need of developing innovative and transformative compression schemes to accelerate the integration of big data and data science into biomedical research. This special session will cover current efforts in this area, as well as future challenges. This is of importance to biologists and researchers alike that work with omics data, as the developed tools will soon become part of their standard pipelines.
This special session will cover current efforts in this area, as well as future challenges. This is of importance to biologists and researchers alike that work with omics data, as the developed tools will soon become part of their standard pipelines.
Room: Osaka / Samarkand (3rd Floor) Thursday July 25, 2:00 p.m. - 4:40 p.m.
Silvio Tosatto, University of Padua, Italy
Zsuzsanna Dosztanyi, Eötvös Loránd University, Hungary
Norman Davey, University College Dublin, Ireland
Damiano Piovesan, University of Padua, Italy
Intrinsically disordered proteins and protein regions (IDPs/IDRs), characterized by high conformational variability, cover almost a third of the residues in Eukaryotic proteomes (Perdigão et al. , 2015; Mistry et al. , 2013) . As major players in protein homeostasis (Iakoucheva et al. , 2004) and cellular signaling (Iakoucheva et al. , 2002) , IDPs are involved in numerous diseases. Over the last two decades, IDPs have developed from being bespoke projects of biophysicists interested in protein (non-)folding to being recognized as a major determinant in cellular regulation (Guharoy et al. , 2015; Chouard, 2011) . One of the key problems with IDPs was the lack of a clear definition of the phenomenon (Dunker et al. , 2001; Uversky, 2002; Wright and Dyson, 1999) as different authors have used it to mean somewhat different things (Orosz and Ovádi, 2011) . This reflects on automatic detection methods for intrinsic disorder. A recent evaluation of these prediction methods on new DisProt examples showed there is a significant room for improvement. A large fraction of IDRs still goes virtually undetected and many predictors appear to confound ID and regions outside X-ray structures (Necci et al. , 2017) .
IDR prediction has been a challenge in CASP but only for few editions due to the difficulty in generating a blind benchmark. With CAID we aim at assessing prediction methods for intrinsic disorder leveraging the manual curation of IDP from DisProt (Piovesan et al. , 2017) .
CAID (Critical Assessment of Intrinsic Disorder) is a community wide experiment to determine and advance the state of the art in the detection of intrinsically disordered residues form the amino acid sequence. Participants are invited to submit new software which is executed locally by the assessors and evaluated mainly on new experimental disorder evidence from the DisProt database. Each round of DisProt annotation produces a new dataset of IDPs, which are used to assess prediction methods’ performance in a so called blind test.
CAID is based on previous work from COST Action NGP-net (https://ngp-net.bio.unipd.it/), a community spanning 30 different countries, plus EMBL Heidelberg and EMBL-EBI. Several ELIXIR nodes (e.g. Italy, Hungary, Ireland) have also included IDP-related resources in their national node roadmap, leading to the recent founding of the international DisProtCentral (http://disprotcentral.org/) umbrella consortium. One key result of the NGP-net has been a comprehensive definition of IDPs in DisProt. DisProt is a database of manually curated IDPs, established over a decade ago in the USA (Sickmeier et al. , 2007) , and recently brought to Europe after years of inactivity and completely re-annotated by NGP-net (Piovesan et al. , 2017, 7) . Annotation rounds are planned to be carried out each year. In 2018 a new annotation round produced annotations for hundreds of new proteins, which were used as a test-set for the first edition of CAID.
The first edition of CAID started with the submission of prediction methods during September 2018 and produced preliminary results presented at CASP 13. Final results will be presented at ISMB during this special session.
Room: Shanghai 1/2 (Ground Floor) Thursday, July 25, 2:00 p.m. - 4:40 p.m.
Stéphanie Boué, Philip Morris Products S.A.
Scientific output is growing at a fast pace (United Nations Educational Scientific and Cultural Organization, 2015) but standards for transparency and reporting of data are often lagging. This lack of transparency can lead to challenges in extracting- and building upon relevant knowledge and following reproducibility issues frequently highlighted in past years (Begley and Ellis, 2012; Begley and Ioannidis, 2015; Button et al., 2013).
Moreover, while knowledge may be extracted both from structured or unstructured text in scientific articles and from data, it is often difficult to keep track and reconcile all of it. Ideally, all of the existing knowledge should be modeled and used together in a way that enables tracing the origin of the data and support for the conclusions. The context in which the results have been obtained should be captured alongside the findings. It is particularly relevant in cases when findings in the literature may appear contradictory. While the insufficient quality of the experiment, i.e., a lack of reproducibility, may explain such discrepancies, it is also possible that a specific mechanism varies depending on the temporal, spatial, or pathological context. Therefore, infrastructure and guidelines that facilitate aggregating well-annotated data and information are needed to enable scientists to harness the power of the data generated globally. Indeed, new and better data sharing and interoperability practices must be put forward and embraced by all stakeholders to realize the full potential of big data in a shared analytical effort by the scientific community (Dinov, 2016; Sansone et al., 2012).
Provided the relevant tools are in place, one can consider how to obtain relevant information to understand our environment and its influence on physiology and pathology. While clinical data is invaluable to reach those objectives, it is often advantageous to first use more accessible and controllable in vitro and in vivo systems to gather preliminary evidence. For example, cell cultures analyzed with high content screening methods can be used to assess the toxicity of compounds and model organisms can be used to test their safety and efficacy. It is essential before drawing definite conclusions to determine the extent to which one can translate specific mechanisms relevant for human health or the ecosystem from the knowledge gained in model systems.
To extract actionable knowledge from big datasets, appropriate analysis tools and statistical methods need to be adopted (Harford, 2014). Ideally, tools developed by diverse groups to analyze data are published so that they can be reused and further developed. Then, once computational methods are developed, it is important that they are unbiasedly benchmarked to other existing practices so that the community can make an informed decision on which way to select for specific use cases. It is particularly true for newly developed fields, such as metagenomics for example.
Using the appropriate methods and the wealth of data that is made available empowers (bio)informaticians to train models using artificial intelligence and to make faster and more significant contributions to the advancement of the life sciences, including personalized medicine.