Text Mining COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Monday, July 11th
10:30-10:40
Text Mining COSI Welcome
Room: MQRN
Format: Live from venue

Moderator(s): Zhiyong Lu

  • COSI Chairs
10:40-11:30
Keynote Presentation: Exploring semantic and genetic disease spaces
Room: MQRN
Format: Live-stream

Moderator(s): Zhiyong Lu

  • Andrey Rzhetsky
11:30-11:50
Improving dictionary-based named entity recognition with deep learning
Room: MQRN
Format: Live from venue

Moderator(s): Zhiyong Lu

  • Katerina Nastou, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Sampo Pyysalo, TurkuNLP group, Department of Computing, Faculty of Technology, University of Turku, Finland, Finland
  • Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark


Presentation Overview: Show

Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to a different entity type requires new high-quality dictionaries and associated lists of blocked names for each different entity type. The latter are typically created by identifying names that cause many false positives through manual inspection of individual names, which scales poorly.
Our aim in this work is to improve blocklists by automatically identifying names to be blocked, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we identified millions of text spans where the methods agree. These were used to generate positive and negative examples of contexts, which were then used to train BioBERT to perform entity type classification. Application of the best model allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes improved the text mining results of many biological databases with associations between biomedical entities, like STRING.

11:50-12:10
Cosine similarity preserving compression of dense embedding vectors
Room: MQRN
Format: Live from venue

Moderator(s): Zhiyong Lu

  • Witold Wydmański, Małopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland, Poland
  • Pariya Mehrbod, Bioinformatics Research - Institute of Molecular Biotechnology, Boku University Vienna, Austria, Austria
  • Peng Li, Massachusetts General Hospital, Harvard Medical School, Cambridge, U.S.A., United States
  • Minjun Chen, NCTR, US FDA, Jefferson, U.S.A., United States
  • Aleksandra Gruca, Silesian University of Technology, Gliwice, Poland, Poland
  • Wenzhong Xiao, Massachusetts General Hospital, Harvard Medical School, Cambridge, U.S.A., United States
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland, Poland
  • David P. Kreil, Bioinformatics Research - Institute of Molecular Biotechnology, Boku University Vienna, Austria, Austria


Presentation Overview: Show

Many text mining applications assess a document by its similarity to a list of predefined concepts, with the cosine similarity of a transformer-based vectorization emerging as a widely used distance metric. The computational cost can be distributed by approaches like Elasticsearch, which even includes a dedicated set of tools for optimizing such queries.
Still, the space required for the embedding of a token by the powerful transformer-based language models means that even a modest retrieval task like indexing all 30 million PubMed abstracts demands about 100 TB, limiting the speed of individual retrieval operations. There is thus a great interest in effective approaches for reducing the space requirements of vector embeddings while retaining the characteristics that make them useful for meaningful search.
We here test established general approaches to dimensionality reduction and recent unsupervised compression algorithms developed specifically for codebook learning. More precisely, we compare their performances in supervised retrieval tasks. We find that the best methods can reduce storage requirements over 8 times while maintaining over 96% of the original accuracy, all the while executing the search directly on the compressed dense embedding vectors, yielding a notable improvement in retrieval efficiency.

12:10-12:30
A deep-learning approach for contextualizing antimicrobial resistance genes
Room: MQRN
Format: Live from venue

Moderator(s): Zhiyong Lu

  • Arman Edalatmand, McMaster University, Canada
  • Xue Ji Zhao, McMaster University, Canada
  • Saduni Rajapaksa, McMaster University, Canada
  • Ramkrishna Upadhyaya, McMaster University, Canada
  • Abdalmuhaymen Ibrahim, McMaster University, Canada
  • Andrew G. McArthur, McMaster University, Canada


Presentation Overview: Show

Antimicrobial outbreak publications outline the key factors involved in an uncontrolled spread of infection. Such factors include the environments, pathogens, hosts, and antimicrobial resistance genes involved. Individually, each paper published in this area gives a glimpse into the devastating impact drug resistance has on healthcare, agriculture, and livestock. When examined together, these papers reveal a story across time, from the discovery of new resistance genes to their dissemination to different pathogens, hosts, and environments.

My work aims to extract this information from publications by using the biomedical deep-learning language model, BioBERT. BioBERT is pre-trained on all abstracts found in PubMed and has state-of-the-art performance with language tasks using biomedical literature. I trained BioBERT on two tasks: entity recognition to identify AMR-relevant terms (i.e., AMR genes, taxonomy, environments, geographical locations, etc.); and relation extraction to determine which terms identified through entity recognition contextualize AMR genes. Datasets were generated semi-automatically to train BioBERT for these tasks. My work currently collates results from 204,094 antimicrobial publications worldwide and generates interpretable results about the sources where genes are commonly found. Overall, my work takes a large-scale approach to collect antimicrobial resistance data from a commonly overlooked resource.

14:30-15:20
Keynote Presentation: A Context-aware Artificial Intelligence Framework for Biomedical Natural Language Processing
Room: MQRN
Format: Live-stream

Moderator(s): Robert Leaman

  • Hongfang Liu


Presentation Overview: Show

In this talk, I will present clinical natural language processing past and current progress. I will discuss the significant opportunities present in translating the technology for support real world applications. Specifically, to achieve real world clinical NLP use, context representation and standardization is necessary which requires the formulation of the task as a context-aware artificial intelligence framework.

15:20-15:30
Poster Lightning Talks
Room: MQRN
Format: Live from venue

Moderator(s): Robert Leaman

  • Posters authors
16:00-16:20
BioRED: A Comprehensive Biomedical Relation Extraction Dataset
Room: MQRN
Format: Live from venue

Moderator(s): Lars Juhl Jensen

  • Ling Luo, NCBI, NLM, NIH, United States
  • Po-Ting Lai, NCBI, NLM, NIH, United States
  • Chih-Hsuan Wei, NCBI, NLM, NIH, United States
  • Cecilia Arighi, University of Delaware, United States
  • Zhiyong Lu, NCBI, NLM, NIH, United States


Presentation Overview: Show

Automated relation extraction (RE) from biomedical literature is critical for text mining application development in both research and the real world. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g., protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical), on a set of 600 PubMed articles. Further, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including BERT-based models, on the named entity recognition (NER) and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a comprehensive dataset can successfully facilitate the development of more accurate, efficient, and robust RE systems for biomedicine.

16:20-16:40
Annotating and Indexing Scientific Articles with Rare Diseases
Room: MQRN
Format: Live-stream

Moderator(s): Lars Juhl Jensen

  • Hosein Azarbonyad, Elsevier, Netherlands
  • Zubair Afzal, Elsevier, Netherlands
  • Max Dumoulin, Elsevier, Netherlands
  • Rik Iping, Erasmus MC University Medical Center Rotterdam, Netherlands
  • George Tsatsaronis, Elsevier, Netherlands


Presentation Overview: Show

In Europe 30 million people are suffering from a rare disease. Rare disease patients are entitled to the best possible health care, constituting the efficient organization of the respective clinical care and scientific literature imperative. This requires deep bibliometrical analysis that can be based in the efficient annotation and indexing of the respective scientific literature.

With this work, we are presenting a novel methodology to annotate scientific articles with concepts that describe rare diseases from the OrphaNet taxonomy (orphadata.org). The technical challenges are several: first, some rare diseases are only rare in a specific part of the population; second, some of the rare diseases are very similar conceptually; third, the OrphaNet taxonomy might be incomplete in certain areas; and, fourth, polysemy and synonymy of the names of rare diseases may still hinder the applicability of any annotation engine. We will discuss how Elsevier has used TERMite, a state of the art annotation engine (to query OrphaNet concepts on Scopus) to address some of these challenges, in combination with advanced NLP and Text Mining techniques. We will demonstrate the results of such an analysis in rare diseases research, and highlight some directions for future research that may address the open challenges.

16:40-17:00
A general NLP approach to automatically interpret any gene list based on the literature
Room: MQRN
Format: Live-stream

Moderator(s): Lars Juhl Jensen

  • Matthew Artuso, Michigan State University, United States
  • Julia Santaniello, Lafayette College, United States
  • Arjun Krishnan, Michigan State University, United States


Presentation Overview: Show

Biomedical researchers use association studies, omics technologies, and high-throughput screens to frequently generate a list or a ranking of genes related to a process, function, trait, or disease. Likewise, computational researchers use data-driven algorithms to prioritize genes in various biomedical contexts. To subsequently make sense of the identified genes, researchers typically scour the literature to find shared themes. However, with hundreds–thousands of related papers containing knowledge described using unstructured, non-standardized text, such literature searches are inefficient and arduous. Here, we have developed a general approach using natural language processing — MyGeneLit – that leverages titles and abstracts of 28 million papers on PubMed to automatically associate genes lists/rankings with ontology terms and biomedical phrases. MyGeneLit uses massive dictionaries of genes and biomedical terms to search for and score gene-term associations based on the relative number of papers co-mentioning each gene-term pair. After aggregation across all input genes and normalization for background signals, MyGeneLit returns a ranking of terms/phrases most pertinent to the user’s genes. We systematically evaluated MyGeneLit and demonstrated its accuracy and high generality by using it to interpret gene lists from genome wide association studies (GWAS) and gene rankings from genotype-tissue expression (GTEx) studies.

17:00-18:00
Panel: Disease analysis with NLP
Room: MQRN
Format: Live from venue

Moderator(s): Lars Juhl Jensen

  • Katerina Nastou , Denmark
  • Hongfang Liu
  • Gary Bader, University of Toronto, Canada