Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Text Mining COSI

Presentations

Schedule subject to change
Monday, July 13th
10:40 AM-11:20 AM
Text Mining Keynote: Collaborative Community Text Mining and Semantic Computing for Biomedical Knowledge Discovery
Format: Live-stream

  • Cathy Wu, University of Delaware, United States

Presentation Overview: Show

In this talk, Dr. Wu will cover research in integrative literature mining, data mining and semantic computing for knowledge discovery, as well as underlying cyberinfrastructure for collaborative research. To realize the value of genomic scale data, her team has developed a semantic computing framework connecting text mining, data mining and biomedical ontologies. Natural language processing and machine learning approaches are employed for information extraction from the literature, along with an automated workflow for large-scale text analytics across documents. The ontological framework allows computational reasoning, and through federated SPARQL queries, it connects complex entities and relations such as gene variants, protein post-translational modifications and diseases mined from heterogeneous knowledge sources. Scientific use cases demonstrate data-driven discovery of gene-disease-drug relationships that may facilitate disease understanding and drug target identification for diseases ranging from Alzheimer's’ to COVID-19.

11:20 AM-11:40 AM
CellMeSH: Probabilistic Cell-Type Identification Using Indexed Literature
Format: Pre-recorded with live Q&A

  • Shunfu Mao, University of Washington, United States
  • Yue Zhang, Paul G. Allen School of Computer Science and Engineering, University of Washington, United States
  • Georg Seelig, ECE & CSE, University of Washington, United States
  • Sreeram Kannan, University of Washington, United States

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-Seq) is becoming widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-Seq experiments aim to identify and quantify all cell-types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are then mapped to cell-types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious effort that requires expert knowledge. We introduce CellMeSH - a new approach to identify cell-types based on gene-set queries directly from literature. CellMeSH combines a database of gene cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and scales automatically. The probabilistic query method enables reliable information retrieval even though the gene cell-type associations extracted from the literature are necessarily noisy. CellMeSH achieves 60% top-1 accuracy and 90% top-3 accuracy in annotating the cell-types on a human dataset, and up to 58.8% top-1 accuracy and 88.2% top-3 accuracy on three mouse datasets, which is consistently better than existing approaches.

12:00 PM-12:10 PM
BERTMeSH: Deep Contextual Representation Learning for Large-scale High-performance MeSH Indexing with Full Text
Format: Pre-recorded with live Q&A

  • Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan
  • Shanfeng Zhu, Fudan University, China
  • Ronghui You, Fudan University, China
  • Eddatt Liu, fudan university, China

Presentation Overview: Show

large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH 1) uses Learning To Rank (LTR), which is time-consuming, 2) can capture some pre-defined sections only in full text, and 3) ignores the whole MEDLINE. We propose a deep learning based method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: 1) the state-of-the-art pre-trained deep contextual representation, BERT (Bidirectional Encoder Representations from Transformers), which makes BERTMeSH capture deep semantics of full text. 2) a transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on approximately 1.5 million full text in PMC. BERTMeSH outperformed various cutting edge baselines. For example, for 20K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant.Also prediction of 20K test articles needed 5 minutes by BERTMeSH, while it took more than 10 hours by FullMeSH, proving the computational efficiency of BERTMeSH.

12:10 PM-12:20 PM
Integrating Image Caption Information into Biomedical Document Classification in Support of Biocuration
Format: Pre-recorded with live Q&A

  • Xiangying Jiang, University of Delaware, United States
  • Pengyuan Li, University of Delaware, United States
  • James Kadin, The Jackson Laboratory, United States
  • Judith Blake, The Jackson Laboratory, United States
  • Martin Ringwald, The Jackson Laboratory, United States
  • Hagit Shatkay, University of Delaware, United States

Presentation Overview: Show

Biological databases provide precise, well-organized information for supporting biomedical research. Developing such databases typically relies on manual curation. However, the large and rapidly increasing publication rate makes it impractical for curators to quickly identify all and only those documents of interest. As such, automated biomedical document classification has attracted much attention.

Images convey significant information for supporting biomedical document classification. Accordingly, using image caption, which provides a simple and accessible summary of the actual image content, has potential to enhance the performance of classifiers.

We present a classification scheme incorporating features gathered from captions in addition to titles-and-abstracts. We trained/tested our classifier over a large imbalanced dataset, originally curated by Gene Expression Database (GXD), consisting of ~60,000 documents, where each document is represented by the text of its title, abstract and captions.

Our classifier attains precision of 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, exceeding the performance of current classifiers employed by biological databases and showing that our framework effectively addresses the imbalanced classification task faced by GXD. Moreover, our classifier’s performance is significantly improved by utilizing information from captions compared to using titles-and-abstracts alone, which demonstrates that captions provide substantial information for supporting biomedical document classification.

12:20 PM-12:30 PM
Bridging the gap: enlisting authors’ help addressing the remaining difficulties in automated concept extraction
Format: Pre-recorded with live Q&A

  • Zhiyong Lu, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, United States
  • Robert Leaman, NCBI/NLM/NIH, United States
  • Chih-Hsuan Wei, National Center of Biotechnology Information (NCBI), United States
  • Alexis Allot, NCBI/NLM/NIH, United States

Presentation Overview: Show

Automated text analysis has proven very effective for retrieving relevant articles from the biomedical literature. But quantitative biomedical research requires the specific knowledge embedded within individual articles and more comprehensive results across the literature. Text mining helps provide this data by automating the conversion of unstructured text such as scientific publications into structured, computable formats.

Recent advances allow automated text mining systems to provide results that are predominantly of very high quality. However, some cases remain difficult for current text mining tools. In this work we propose ten writing tips to enlist the help of authors. We also propose a web-based tool, PubReCheck, to help authors visualize results of automated concept extraction on their text and to automatically identify many types of issues prior to publication.

Articles that follow these guides will typically be processed more accurately, enabling the content to be found more readily and used more widely. Following these guidelines at a large scale will improve the ability of millions to find articles that meet their information needs and the ability of text mining tools to provide structured, computable data that enable larger scale and more rapid analyses.

Availability: https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/PubReCheck/

12:30 PM-12:40 PM
Promoting Reproducible Research for Characterizing Nonmedical use of Medications through Data Annotation: A Description of a Twitter Corpus and Guidelines
Format: Pre-recorded with live Q&A

  • Karen Oconnor, University of Pennsylvania, United States
  • Graciela Gonzalez-Hernandez, University of Pennsylvania, United States
  • Abeed Sarker, Emory University, United States
  • Jeanmarie Perrone, University of Pennsylvania, United States

Presentation Overview: Show

Due to the contribution of prescription medications, such as stimulants or opioids, to the broader drug abuse crisis, several studies have explored the use of social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a lack of clear descriptions of how abuse information is presented in public social media, as well as few thorough annotation guidelines that may serve as the groundwork for long-term future research or annotated datasets usable for automatic characterization of social media chatter associated with abuse-prone medications.
We employed an iterative annotation strategy to create an annotated corpus of 16,433 tweets, mentioning one of 20 abuse-prone medications, labeled as potential abuse or misuse, non-abuse consumption, drug mention only, and unrelated. We experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. Among the machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271).
We expect that our annotation strategy, guidelines, and dataset will provide a significant boost to community-driven data-centric approaches for the task of monitoring prescription medication misuse or abuse monitoring from Twitter.

2:00 PM-2:40 PM
Text Mining Keynote: Text mining to understand drug action: from PubMed to Reddit
Format: Live-stream

  • Russ Altman, Stanford University, United States
2:40 PM-3:00 PM
Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision
Format: Pre-recorded with live Q&A

  • Xuan Wang, University of Illinois at Urbana-Champaign, United States
  • Yingjun Guan, University of Illinois at Urbana-Champaign, United States
  • Jiawei Han, University of Illinois at Urbana-Champaign, United States
  • Xiangchen Song, University of Illinois at Urbana-Champaign, United States
  • Bangzheng Li, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Motivation: To facilitate biomedical text mining research on COVID-19, we have developed an automated, comprehensive named entity annotation and typing system CORD-NER. The system generates an annotated CORD-NER dataset based on the COVID-19 Open Research Dataset, covering 75 fine-grained types in high quality. Both the system and the annotated literature datasets will be updated timely.
Results: The distinctive features of this CORD-NER dataset include (1) it covers 75 fine-grained entity types: In addition to the common biomedical entity types (e.g., genes, chemicals and diseases), it covers many new entity types specifically related to the COVID-19 studies (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses), which may benefit research on COVID-19 related virus, spreading mechanisms, and potential vaccines; (2) it relies on distantly- and weakly-supervised NER methods, with no need of expensive human annotation on any articles or sub-corpus; and (3) its entity annotation quality surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Our NER system supports incrementally adding new documents as well as adding new entity types when needed by adding dozens of seeds as the input examples.

3:30 PM-3:40 PM
COVID-19 systems demo panel (part 2)
Format: Pre-recorded with live Q&A

  • Xuan Wang, University of Illinois at Urbana-Champaign, United States
  • Jiawei Han, University of Illinois at Urbana-Champaign, United States
3:40 PM-3:50 PM
COVID-19 systems demo panel (part 3)
Format: Pre-recorded with live Q&A

  • Tonia Korves, MITRE, United States
3:50 PM-4:00 PM
COVID-19 systems demo panel (part 4)
Format: Pre-recorded with live Q&A

  • Fabio Rinaldi, University of Zurich, Switzerland
4:00 PM-4:10 PM
COVID-19 systems demo panel (part 5)
Format: Pre-recorded with live Q&A

  • John Bachman, Laboratory of Systems Pharmacology, Harvard Medical School, United States
4:10 PM-4:40 PM
COVID-19 systems demo panel (part 6)
Format: Live-stream

  • Fabio Rinaldi, University of Zurich, Switzerland
  • Amalie Trewartha, Lawrence Berkeley National Laboratory, United States
  • Xuan Wang, University of Illinois at Urbana-Champaign, United States
  • Jiawei Han, University of Illinois at Urbana-Champaign, United States
  • Tonia Korves, MITRE, United States
  • John Bachman, Laboratory of Systems Pharmacology, Harvard Medical School, United States
5:00 PM-5:20 PM
Proceedings Presentation: PEDL: Extracting protein-protein associations using deep language models and distant supervision.
Format: Pre-recorded with live Q&A

  • Leon Weber, Humboldt-Universität zu Berlin, Germany
  • Kirsten Thobe, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Germany
  • Oscar Arturo Migueles Lozano, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Germany
  • Jana Wolf, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Germany
  • Ulf Leser, Humboldt-Universität zu Berlin, Germany

Presentation Overview: Show

Motivation: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance.

Results: We propose PEDL, a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different data sets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three data sets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway data bases and that it correctly identifies the text spans supporting the PPA.

Availability: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used data sets and to reproduce the experiments from this paper.

Contact: leser@informatik.hu-berlin.de or jana.wolf@mdc-berlin.de

Supplementary information: Supplementary data are available at Bioinformatics online.

5:20 PM-5:40 PM
Deep Semi-supervised Ensemble Method for Classifying Co-mentions of Human Proteins and Phenotypes
Format: Pre-recorded with live Q&A

  • Morteza Pourreza Shahri, Montana State University, United States
  • Indika Kahanda, Montana State University, United States

Presentation Overview: Show

Identifying human protein-phenotype relations is of paramount importance for uncovering rare and complex diseases. Human Phenotype Ontology (HPO) is a standardized vocabulary for describing disease-related phenotypic abnormalities in humans. Since the experimental determination of HPO categories for human proteins is a highly resource-consuming task, developing automated tools that can accurately predict HPO categories has gained interest recently. In this work, we develop a novel method for classifying sentence co-mentions of human proteins and HPO names using semi-supervised learning and deep neural networks. Our model is a combination of BERT, CNN, RNN models, and a self-learning module, that uses a large collection of unlabeled co-mentions available from ProPheno, which is an online database developed in a previous study. Using a gold-standard dataset composed of curated sentence-level co-mentions, we demonstrate that our proposed model provides the state-of-the-art performance in the task of classifying human protein-phenotype co-mentions by outperforming two supervised and semi-supervised support vector machines model counterparts. The findings and the insight of this work have implications for biocurators, researchers, and bio text mining tool developers.

5:40 PM-6:00 PM
A Hybrid Method for Phenotype Concept Recognition using the Human Phenotype Ontology
Format: Pre-recorded with live Q&A

  • Zhiyong Lu, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, United States
  • Ling Luo, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, United States
  • Shankai Yan, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, United States
  • Po-Ting Lai, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, United States
  • Daniel Veltri, National Institute of Allergy and Infectious Diseases, United States
  • Andrew Oler, National Institute of Allergy and Infectious Diseases, United States
  • Sandhya Xirasagar, National Institute of Allergy and Infectious Diseases, United States
  • Rajarshi Ghosh, National Institute of Allergy and Infectious Diseases, United States
  • Morgan Similuk, National Institute of Allergy and Infectious Diseases, United States
  • Peter Robinson, The Jackson Laboratory for Genomic Medicine, United States

Presentation Overview: Show

Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research despite a few attempts in the past. In previous works, dictionary-based and machine learning-based methods are attempted for phenotype concept recognition. The dictionary-based methods can achieve a high precision, but it suffers from a lower recall problem. Machine learning-based methods can recognize more phenotype concept variants by automatic feature learning. However, most of them require large corpora of manually annotated data for its model training, which is difficult to obtain due to the high cost of human annotation. To address the problems, we propose a hybrid method combining dictionary-based and machine learning-based methods to recognize the Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We firstly use all concepts and synonyms in HPO to construct a dictionary. Then we employ the dictionary-based method and HPO to automatically build a “weak” training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase into a corresponding concept id. Finally, machine learning-based prediction results are incorporated into the dictionary-based results for improved performance. Experimental results show that our method can achieve state-of-the-art performance without any manually labeled training data.