Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Text Mining COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in UTC
Wednesday, July 28th
11:00-11:40
Text Mining Keynote
Format: Live-stream

Moderator(s): Robert Leaman

  • Sampo Pyysalo, University of Turku, Finland
11:40-12:00
Systematic tissue annotations of –omics samples by modeling unstructured metadata
Format: Pre-recorded with live Q&A

Moderator(s): Robert Leaman

  • Nathaniel Hawkins, Michigan State University, United States
  • Marc Maldaver, Michigan State University, United States
  • Anna Yannakopoulos, Michigan State Univeristy, United States
  • Lindsay Guare, Michigan State University, United States
  • Arjun Krishnan, Michigan State University, United States

Presentation Overview: Show

There are currently >1 million publicly available human –omics samples but they remain acutely underused because discovering relevant samples is still a significant challenge. This is because sample attributes such as tissue-of-origin are routinely described using non-standard terminologies written in unstructured natural language. Here, we propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for –omics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample text descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms in a structured ontology. Our approach outperforms an advanced annotation method (MetaSRA) that uses graph-based reasoning and a baseline method (TAGGER) that uses exact string matching. We demonstrate that the trained NLP-ML tissue classification models capture tissue ontological relationships and generalize to classifying any text including tissue-specific biological processes and diseases based on their descriptions alone. We also show that NLP-ML models are as accurate as models based on expression profiles in predicting tissue annotations of microarray samples and can naturally classify samples of other experiment types (e.g., RNA-seq, ChIP-seq, and methylation). The NLP-ML prediction code and tissue models are available at https://github.com/krishnanlab/txt2onto.

12:00-12:10
ThermoScan: A text-mining approach for the identification of thermodynamic data on protein folding from full-text articles
Format: Pre-recorded with live Q&A

Moderator(s): Robert Leaman

  • Paola Turina, University of Bologna, Italy
  • Piero Fariselli, University of Torino, Italy
  • Emidio Carpiotti, University of Bologna, Italy

Presentation Overview: Show

The improvement of machine learning methods for predicting the impact of protein variants at functional and structural levels relies on the availability of large experimental datasets. The process of collecting such data is highly time consuming and expensive. In particular, the development of methods for predicting the effect of amino acid variants on protein stability needs the extraction of thermodynamic data from a large body of literature. For facilitating the latter process, we developed ThermoScan, a text mining approach for identifying the presence of thermodynamic data on protein stability in full-text articles from the PubMed Central (PMC) Open Access subset. The method relies on a regular expression search on 3 groups of words, from which an empirical score is calculated: (1) the most common conceptual words in experimental studies on protein stability, (2) common thermodynamic parameters, (3) their units of measure. The method was optimized on a set of publications included in the ProTherm database and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts.

12:10-12:20
Quantifying general and discipline-specific word understandability towards better science communication
Format: Pre-recorded with live Q&A

Moderator(s): Robert Leaman

  • Matthew Artuso, Michigan State University, United States
  • Anna Yannakopoulos, Michigan State University, United States
  • Daniel Gomez, Stuyvesant High School, United States
  • Nathaniel Hawkins, Michigan State University, United States
  • Jainil Shah, Okemos High School, United States
  • John Zubek, Michigan State University, United States
  • Arjun Krishnan, Michigan State University, United States

Presentation Overview: Show

Science literature contains jargon to facilitate efficient communication between experts in an increasingly specialized research space. However, this jargon also makes research inaccessible to non-experts who follow these sciences. To break down this barrier in making science research findings broadly-accessible, we propose a new measure that quantifies a word’s likely understandability. To develop this new measure, we used Wikipedia articles and their category tags to generate statistical distributions of individual words across categories. Then, for each word, we calculated the uniformity of its distribution and combined it, using dimensionality reduction, with that word’s median frequency to get a single word complexity score. We also repurposed this procedure to develop a novel method to automatically identify basic terminology specific to a scientific field. This field-specific complexity score combines how non-uniform a word is in science but common in a specific field to a single score. Combining both of these word lists creates a dictionary of basic english and elementary scientific terminology with a very adaptable method that can be applied to nearly any scientific field, without requiring any manual curation of a word list.

12:40-13:20
Text Mining Keynote: Text Mining for Drug Safety at FDA: Autonomous Machines vs Human-in-the-Loop
Format: Live-stream

Moderator(s): Cecilia Arighi

  • Lynette Hirschman
  • Robert Ball
13:20-13:40
A machine learning framework for discovering and enriching metagenomics metadata from open access research articles
Format: Pre-recorded with live Q&A

Moderator(s): Cecilia Arighi

  • Maaly Nassar, EMBL-EBI, United Kingdom
  • Robert D Finn, EMBL-EBI, United Kingdom
  • Johanna McEntyre, EMBL-EBI, United Kingdom

Presentation Overview: Show

Metagenomics is a culture-independent approach for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally and/or taxonomically), either from a longitudinal study or between independent studies can provide clues into how the microbiota have adapted to a particular environment. However, to understand the impact of environmental factors on the microbiome, it is important to also account for experimental confounding factors. Metagenomics databases, such as MGnify , provide analytical services to enable the consistent functional and taxonomic annotations to mitigate bioinformatic confounding factors. However, a recurring challenge is that key metadata about the sample (e.g. location, pH) and molecular methods used to extract and sequence the genetic material are often missing from the sequence records. Nevertheless, this missing metadata may be found in publications describing the research. When identified, the additional metadata can lead to a substantial increase in data reuse and greater confidence in the interpretation of observed biological trends. Here, we describe a machine learning framework that automatically extracts relevant metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework includes 3 processes: (1) literature classification and triage, (2) named entity recognition (NER) and (3) database enrichment.

13:40-14:00
Automatic Extraction of Twelve Cardiovascular Concepts from German Discharge Letters using Pre-trained Language Models
Format: Pre-recorded with live Q&A

Moderator(s): Cecilia Arighi

  • Christoph Dieterich, University Hospital Heidelberg, Germany
  • Phillip Richter-Pechanski, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany
  • Christina Kiriakou, Department of Internal Medicine III, University Hospital Heidelberg, Germany
  • Dominic Schwab, Department of Internal Medicine III, University Hospital Heidelberg, Germany
  • Nicolas Geis, Department of Internal Medicine III, University Hospital Heidelberg, Germany

Presentation Overview: Show

A vast amount of data in clinical routine is stored in unstructured form. To make these data available for research we need to develop automatic clinical information extraction methods. Recent developments in natural language processing showed promising results using deep learning based pre-trained language models on limited training data.
We evaluated pre-trained language models to recognize and classify a set of twelve cardiovascular concepts in German discharge letters. We used a selection of three pre-trained BERT models, bidirectional encoder representations from transformers and fine-tuned them on the task of cardiovascular concept extraction using 204 discharge letters manually annotated by cardiologists at the University Hospital Heidelberg. As a baseline we compared our results to a well-established statistical machine learning approach based on a conditional random field.
Our best performing model, using a publicly available German pre-trained BERT model, achieved a token-wise micro-average F1-score of 86% and outperformed the baseline by 8%.
These results show the applicability of deep learning methods using pre-trained language models for the task of cardiovascular concept extraction on limited training data. This minimizes annotation efforts, which are the bottle neck of the application of data-driven deep learning project in the clinical domain.

14:20-14:40
Proceedings Presentation: Utilizing Image and Caption Information for Biomedical Document Classification
Format: Pre-recorded with live Q&A

Moderator(s): Cecilia Arighi

  • Pengyuan Li, University of Delaware, United States
  • Xiangying Jiang, University of Delaware, United States
  • Gongbo Zhang, University of Delaware, United States
  • Juan Trelles Trabucco, University of Illinois at Chicago, United States
  • Daniela Raciti, California Institute of Technology, United States
  • Cynthia Smith, The Jackson Laboratory, United States
  • Martin Ringwald, The Jackson Laboratory, United States
  • G. Elisabeta Marai, University of Illinois at Chicago, United States
  • Cecilia Arighi, University of Delaware, United States
  • Hagit Shatkay, University of Delaware, United States

Presentation Overview: Show

Biomedical research findings are typically disseminated through publications. To simplify access to domain specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature - a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results.
We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions, and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance.

14:40-15:00
Spotlight on Practical Text Mining Tools: LitSuggest & TeamTat
Format: Live-stream

Moderator(s): Cecilia Arighi

  • Zhiyong Lu, NCBI, NLM, NIH, USA

Presentation Overview: Show

In recent years, automated text-mining tools have rapidly matured: although not perfect, they are frequently used to support biomedical science, especially data-driven research that requires structured and computable data. Over the years, the NCBI/NLM research team has developed a suite of open-source tools (e.g. PubTator) that are widely used in biomedical text mining. In this talk, I will focus on presenting two recently-developed practical systems – LitSuggest and TeamTat – addressing the needs for literature search and annotation respectively. In a nutshell, LitSuggest (https://www.ncbi.nlm.nih.gov/research/litsuggest/) is a user-friendly system for literature recommendation and curation and uses advanced machine learning techniques for suggesting relevant PubMed articles with high accuracy. For tech-savvy users, TeamTat (https://www.teamtat.org) is a novel tool for managing multi-user, multi-label document annotation, as manually annotated data is key to developing and validating text-mining and information-extraction algorithms. Together, they help bridge the gap between text-mining practitioners, developers, and end users.

15:00-15:20
Spotlight on Practical Text Mining Tools: PubMedBERT
Format: Live-stream

Moderator(s): Cecilia Arighi

  • Hoifung Poon

Presentation Overview: Show

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this presentation, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained PubMedBERT models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding and Reasoning Benchmark) at https://aka.ms/BLURB.



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube