|10:15 AM - 10:20 AM||Welcome TextMining|
|10:20 AM - 11:00 AM||Morning Keynote||Sophia Ananiadou, The University of Manchester, United Kingdom|
|11:00 AM - 11:20 AM||Tissue-aware framework for unraveling rare and complex diseases using biomedical literature||Ruth Dannenfelser, Princeton University, United States|
|11:20 AM - 11:40 AM||Discovery of disease- and drug-specific pathways through community structures of a literature network||Minh Pham, Baylor College of Medicine, United States|
|11:40 AM - 11:50 AM||OntoMate: a text-mining tool to facilitate curation at the Rat Genome Database||Jennifer R. Smith, Rat Genome Database, Medical College of Wisconsin, United States|
|11:50 AM - 12:00 PM||Cloud-based Phrase Mining Reveals Critical Molecular Insights of Major Cardiovascular Diseases.||Peipei Ping, BD2K Center of Excellence @ UCLA, United States|
|12:00 PM - 12:20 PM||Learning Structured Knowledge from Clinical Case Reports||Wei Wang, University of California, Los Angeles, United States|
|12:20 PM - 12:40 PM||A text-mined integrated knowledge map for MicroRNAs||Debarati Roychowdhury, University of Delaware, United States|
|2:00 PM - 2:40 PM||Afternoon Keynote||Lawrence Hunter, UC Denver, United States|
|2:40 PM - 3:00 PM||Deep Neural Networks Ensemble for Detecting Medication Mentions in Tweets||Gonzalez-Hernandez Graciela, University of Pennsylvania, United States|
|3:00 PM - 3:20 PM||ClaimMiner: Query-guided Claim Mining in Biomedical Literature||Jiawei Han, BD2K Center of Excellence @ UIUC, United States|
|3:20 PM - 3:30 PM||FullMeSH: Improving Large-Scale MeSH Indexing with Full Text||Shanfeng Zhu, Fudan University, China|
|3:30 PM - 3:40 PM||An Effective Biomedical Document Classification Scheme in Support of Biocuration: Addressing Class Imbalance||Xiangying Jiang, University of Delaware, United States|
|3:40 PM - 3:50 PM||Using the power of text-mining for biological discovery with Europe PMC Annotations platform||Aravind Venkatesan, EMBL-EBI, United Kingdom|
|3:50 PM - 4:00 PM||A new approach and gold standard toward author disambiguation in MEDLINE||Raul Rodriguez-Esteban, F. Hoffmann-La Roche Ltd., Switzerland|
|4:40 PM - 5:00 PM||Poster lightning talks - TextMining|
|5:00 PM - 6:00 PM||Panel discussion: New challenges and opportunities in biomedical text mining and beyond||Graciela Gonzalez-Hernandez, University of Pennsylvania, United States|
Martin Krallinger, Barcelona Supercomputing Center, Spain
Hongfang Liu, Mayo Clinic, United States
Raul Rodriguez-Esteban, F. Hoffmann-La Roche Ltd., Switzerland
Tissue-specific alterations in gene function, expression, and interactions mediate disease pathophysiology. However, tissue specificity is not currently systematically curated in disease databases and is limited in breadth and scale. Furthermore, a host of ethical and technical challenges make it difficult to conduct large-scale experiments on human tissues in vivo. Here, we present PRETA (PREdictions with Tissue Associations), a fast, tissue-aware framework which learns tissue-disease-gene associations by systematic evidence integration from over 2.6 million biomedical abstracts, discovering 3.5 billion associations across 62 tissues, 3,252 diseases and phenotypes, and 18,299 genes. By integrating the information contained in literature, we unlock access to knowledge that is typically constrained to expert domains. We report four major contributions in addition to our tissue-aware framework: accurate disease-gene associations; a landscape of disease similarity for interpretation of rare and complex diseases; a literature-based tissue disease enrichment approach; and a website for browsing associations and calculating tissue disease enrichment. Our framework allows researchers to pinpoint genetic causes of tissue-specific dysfunction in complex diseases, interpret GWAS hits and differentially expressed gene sets, assess drug efficacy and suggest candidates for drug repurposing. Through knowledge transfer across diseases, PRETA provides insight into the mechanisms of rare and complex diseases and disorders.
In response to the exponential growth of scientific publications, text mining is increasingly used to extract biological pathways. Though multiple tools explore individual connections between genes, diseases, and drugs, not many extensively examine contextual biological pathways for specific drugs and diseases. We extracted 3,444 functional gene groups for specific diseases and drugs by applying a community detection algorithm to a literature network. The network aggregated co-occurrences of Medical Subject Headings (MeSH) terms for genes, diseases, and drugs in publications. The detected literature communities were groups of highly associated genes, diseases, and drugs. The communities significantly captured genetic knowledge of biological pathways and recovered future pathways in time-stamped experiments. Furthermore, the disease- and drug-specific communities recapitulated known pathways for those given diseases and drugs. In addition, diseases in same communities had high comorbidity with each other and drugs in same communities shared great numbers of side effects, suggesting that they shared mechanisms. Indeed, the communities robustly recovered mutual targets for drugs (AUROC = 0.75) and shared pathogenic genes for diseases (AUROC = 0.82). These data show that the literature communities not only represented known biological processes but also suggested novel disease- and drug-specific mechanisms, facilitating disease gene discovery and drug repurposing.
The Rat Genome Database (RGD, https://rgd.mcw.edu) is the premier online repository of rat genomic, genetic and physiologic data and is being further developed as a cross-species platform for translational research. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism databases. To aid curators in the task of identifying curatable papers and the relevant data therein, RGD has developed an ontology-driven custom text mining software tool called OntoMate, that is tightly integrated with RGD's curation software. OntoMate's backend tools analyze the text and extract information from articles in order to enrich processed articles with semantic tags, including gene symbols, mutations, species, and terms from eleven ontologies used for curation. Named Entity Recognition for biocuration was implemented using plugins provided by bioNLP frameworks. Results are stored in a Hadoop NoSQL database which can be scaled horizontally. Tagged abstracts matching curator input are presented in a UI that provides user-activated filters and an integrated ontology browser to expedite the curation process. Proposed future enhancements include the ability to recognize tables containing quantitative phenotype data for RGD's PhenoMiner curation in full text articles and, if possible, supplementary materials.
We present two examples where the application of the Context-aware Semantic Online Analytical Processing (CaseOLAP) platform successfully extracted relevant information from text data and offered critical mechanistic insights in cardiovascular medicine. The rapid accumulation of biomedical textual data has far exceeded the human capacity of manual curation and analysis, necessitating novel text-mining tools to extract biological insights from large volumes of scientific reports. The CaseOLAP pipeline developed in 2016, successfully quantifies user-defined phrase-category relationships through the analysis of textual data. We have developed a protocol for a cloud-based environment supporting the end-to-end phrase-mining and analyses platform. Our protocol includes data preprocessing (e.g., downloading, extraction, and parsing text documents), indexing and searching with Elasticsearch, creating a functional document structure “Text-Cube”, and quantifying phrase-category relationships using the CaseOLAP algorithm. Our data preprocessing generates key-value mappings for all documents involved. The preprocessed data is indexed to carry out a search of documents including entities, which further facilitates the Text-Cube creation and CaseOLAP score calculation. The raw CaseOLAP scores are interpreted using a series of integrative analyses, including dimensionality reduction, clustering, temporal, and geographical analyses. Additionally, the CaseOLAP scores are used to create a graphical database, which enables semantic mapping of the documents.
Unstructured data constitute a unique, rapidly expanding type of biomedical data and a treasure trove of undiscovered insights. Much of these data are within published manuscripts: PubMed currently indexes well over 29 million documents, including more than 2 million clinical case reports (CCRs). The giant volume of these text data, their variable structure, and their heterogeneous subdomains collectively present a herculean challenge for the biomedical research community in parsing them for discrete biomedical relationships. The abilities of human readers are amplified by tools to index and discern content within unstructured biomedical text, though scalable methods do not yet exist, requiring automated approaches to be built upon extensive manual annotation and curation. In this talk, we will present our latest research on Clinical Report Extraction and Annotation Technology (CREATe). This approach includes new natural language processing and machine learning models and algorithms capable of accurately recognizing entities corresponding to concepts and events in CCRs, determining their optimal types and relationships using distantly-supervised learning guided by existing ontologies and taxonomies, and minimizing human annotation. Our goals are to extract, organize, and learn from biomedical concepts within unstructured text, and translate them into a unified knowledge representation supporting efficient inference, integration, and interpretation.
Motivation: miRNAs are essential gene regulators and their dysregulation often lead to disease. Easy access to miRNA information is crucial for exploiting existing knowledge with the aims of designing new experiments, interpreting generated experimental data, connecting facts across publications and generating new hypotheses that build on previous knowledge. Here, we present an integrative text mining approach to collect miRNA information from the literature.
Results: We collected 100,000 miRNA-PubMed ID pairs from Medline. We used a set of existing publicly available and in-house developed tools to extract bioentities and their relation with miRNA. The entity pairs include: miRNA-gene to detect miRNA-gene regulation (52,105 relations); miRNA-disease (51,569 relations) and differential expression level-miRNA to capture the differential expression of a miRNA in the context of disease vs. normal/disease stage (7,259 relations); miRNA-biological process (35,273 relations); circulatory miRNAs to capture potential biomarkers (6,599 relations); and miRNA-tissues and organs to provide biological context. Bioentities were normalized to facilitate querying and integration. We built a database and an interface to store and access the integrated data, respectively.
Conclusion: We will demonstrate that our resource can assist in answering relevant biological questions by evaluating miRNA-associated signatures in glioblastoma multiforme, the most aggressive form of primary brain tumor.
Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names may fail due to misspellings or ambiguity with common words, we propose a new method to recognize them.
We present Kusuri, a classifier able to identify tweets mentioning drug products and dietary supplements. First, Kusuri applies four different classifiers (lexicon-based, spelling-variant-based, pattern-based and one based on a weakly-trained neural network) in parallel to discover tweets potentially containing medication names. Then, Kusuri classifies the tweets discovered using an ensemble of deep neural networks.
On a balanced corpus of 15,005 tweets, Kusuri demonstrated performances close to human annotators with 93.7% F1-score, the best score achieved on this corpus. On a corpus made of all tweets posted by 113 Twitter users, 98,959 tweets with only 0.26% mentioning medications, Kusuri obtained 76.3% F1-score. There is not prior drug extraction system that compares running on such an unbalanced dataset.
The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness once integrated in larger natural language processing systems.
Motivation: Claim mining is a text mining task that automatically extracts literature evidence to support scientific hypothesis validation. Previous work on claim mining assumes a small set of human-annotated articles of claims be given as the training examples. However, it is non-trivial to select a set of human-annotated articles and the annotation is prone to errors.
Results: We propose ClaimMiner, the first query-guided claim mining method for biomedical literature without human-annotated training examples. Given a query, ClaimMiner incorporates the information from the words and entities in the query and textual patterns automatically extracted from massive corpora, extracts the related sentences, and ranks them as claims. Moreover, ClaimMiner allows queries containing general entity types, which has never been explored by previous claim mining methods. ClaimMiner is evaluated on a subset corpus of PubMed and shows great performance on claim ranking.
With the rapidly growing biomedical literature, automatically indexing biomedical articles by Medical Subject Heading (MeSH), namely MeSH indexing, has become increasingly important for facilitating hypothesis generation and knowledge discovery. Over the past years, many large-scale MeSH indexing approaches have been proposed, such as Medical Text Indexer (MTI), MeSHLabeler and DeepMeSH. However, the performance of these methods is hampered by using limited information, i.e. only the title and abstract, of biomedical articles.
We propose FullMeSH, by taking advantage of the recent increase in the availability of full text articles. Compared to DeepMeSH, one of the start-of-the-art method, FullMeSH has three novelties: 1) Instead of using a full text directly, FullMeSH segments it into several sections in order to distinguish their contributions to the overall performance; 2) FullMeSH integrates evidence from different sections in a “learning to rank" framework by combining sparse and deep semantic representation; and 3) FullMeSH trains an Attention-based Convolutional Neural Network (AttentionCNN) for each section, which achieves a better performance on infrequent MeSH headings. FullMeSH achieved a Micro F-measure of 66.76% on a test set of 10,000 articles which was 3.3% and 6.4% higher than DeepMeSH and MeSHLabeler, respectively.
The published literature is an important source of information supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. Notably, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential in the context of biomedical document classification.
We present here an effective classification scheme for identifying publications containing relevant information for a Mouse Genome Informatics actual curation task involving a large imbalanced dataset. The scheme is based on meta-classification, employing cluster-based under-sampling combined with named-entity recognition (NER) and statistical feature selection strategies.
We examined the performance of our method over a large imbalanced dataset, was originally generated and curated by the Jackson Laboratory’s Gene Expression Database (GXD), consisting of more than 90,000 PubMed abstracts. Our results, 0.80 recall, 0.72 precision and 0.75 f-measure, demonstrate that our classification scheme effectively categorizes such a large dataset in the face of data imbalance.
In the life sciences, most experimental results and analyses are still shared via research publications, which, given over 1M papers per year and the natural language of text, provides challenges to finding and assimilating scientific facts. To this end, text mining facilitates the extraction of scientific assertions with their biological context. However, it requires expertise in algorithms and data management, which can be a barrier. Therefore, we have established a community platform for text-mined annotations as a part of Europe PMC, in which text mining groups contribute results that are aggregated (~500 million annotations) and made available via APIs and an application called SciLite, to display the annotations on articles. This approach allows curators and bioinformaticians, as well as other stakeholders, to more readily integrate text-mined outputs in their data analysis without having to do the text mining themselves. For the ISMB/ECCB 2019 conference, we will present the Europe PMC Annotations platform, discuss the approach and the challenges in more detail.