NIH ODSS/ELIXIR
Attention Presenters - please review the Speaker Information Page available
here
NIH-ELIXIR Track on the BioData Ecosystem
Schedule subject to change
All times listed are in CEST
10:30-10:50
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: PRIDE & ProteomeXchange: Making proteomics data FAIR
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
- Juan Antonio Vizcaino, European Bioinformatics Institute (EMBL-EBI), United Kingdom
- The ProteomeXchange Consortium
Presentation Overview: Show
Mass spectrometry (MS) is by far the most used experimental approach in high-throughput proteomics. The ProteomeXchange consortium (https://www.proteomexchange.org) has standardized data submission and dissemination of public MS proteomics data worldwide and has recently been named a Global Core Biodata Resource. ProteomeXchange resources are committed to comply with the FAIR (Findable, Accessible, Interoperable, Re-usable) principles, support reproducible research and represent the state-of-the-art in proteomics with regards to open data practices. The six members of the Consortium are PRIDE (UK), PeptideAtlas, MassIVE and Panorama Public (USA), jPOST (Japan) and iProX (China). Within ProteomeXchange, the PRIDE database at the European Bioinformatics Institute (an ELIXIR core data resource) is the most used resource, accounting for ~80% of all submitted datasets worldwide. As a key point, the activities of ProteomeXchange are aligned with the open data standards developed under the umbrella of the Proteomics Standards Initiative.
The perceived reliability of PRIDE and the rest of the ProteomeXchange resources has enabled an unprecedented increase in the amount of proteomics data in the public domain, which is now comparable to other omics fields such as transcriptomics. As a consequence, data re-use activities are flourishing and are revolutionizing the proteomics field. Some inspiring examples of how this data is been utilised by the scientific community will be showcased. Finally, some insights on the upcoming challenges will also be discussed, including the management of sensitive (clinical) human proteomics datasets.
10:50-11:10
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: PhysioNet: A Quarter Century of Open Health Data
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
- Tom Pollard, Massachusetts Institute of Technology (MIT), United States
Presentation Overview: Show
PhysioNet is a data sharing platform that began as an outreach component for an NIH research project in 1999. Rebuilt in 2019 following FAIR principles (Findable, Accessible, Interoperable, Reusable), the platform has grown rapidly. It now serves over 75,000 registered users around the world with >30TB of data and is heavily used across research, education, and industry. PhysioNet is a recommended repository for journals including the Springer Nature collection, eLife, and PLOS. It also supports regular “datathons” internationally, which bring together clinicians and data scientists to focus on important, unanswered questions in health research. PhysioNet has been a close collaborator of MIT Libraries and it is piloting their data citation service, helping to help establish datasets as primary research objects and to reward those who share.
While the vast majority of data on PhysioNet is fully open access, the platform supports training requirements and access control where necessary. This allows researchers to share sensitive resources that would not be possible through typical data sharing platforms. Around half of all PhysioNet users have been “credentialed”, providing evidence of their identity and training in human research. PhysioNet uses ORCID Trust Markers as part of this process. The software that underpins PhysioNet has been made completely open source and we are working to create a network of new, partner platforms. Repositories are being piloted by University of Mbarara in Uganda and University of Toronto, as part of the Temerty Centre for AI Research and Education in Medicine (T-CAIREM). Our goal is a network of interconnected repositories that share resources while maintaining local control and governance.
11:10-11:30
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: From ArrayExpress to BioStudies
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
- Ugis Sarkans, EMBL-EBI, UK
Presentation Overview: Show
ArrayExpress was an archive of functional genomics data at EMBL-EBI, established in 2002. Initially it served as a database for publication-related microarray data, and was later extended to accept sequencing-based datasets. Over the last decade an increasing share of biological experiments involve multiple technologies assaying different biological modalities. Also, new technologies generate data that do not yet have accepted community guidelines, standards, and databases. The BioStudies database (https://www.ebi.ac.uk/biostudies) was established to organize and publish multimodal data, as well as data where specialized databases do not exist. Its central concept is a study, which typically is associated with a publication. BioStudies stores metadata describing the study, provides links to the relevant databases, such as European Nucleotide Archive (ENA), as well as hosts types of data with no other home available. Since late 2022 all the existing ArrayExpress datasets are archived and distributed from BioStudies, and new functional genomics data come into BioStudies through the Annotare data submission tool. We strived for a seamless transition, also for data access where queries and data downloads are provided largely in the same manner as before. Support for typical life sciences data publishing workflows, in conjunction with a flexible metadata model, make BioStudies a data sharing solution for emerging communities, as well as a sustainable platform for established data types.
11:30-11:40
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: The European Nucleotide Archive
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
- Josephine Burgin, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Alisha Ahamed, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Carla Cummins, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Rajkumar Devraj, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Khadim Gueye, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Dipayan Gupta, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Vikas Gupta, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Muhammad Haseeb, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Maira Ihsan, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Eugene Ivanov, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Suran Jayathilaka, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Vishnukumar Balavenkataraman Kadhirvelu, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Manish Kumar, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Ankur Lathi, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Rasko Leinonen, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Milena Mansurova, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Jasmine McKinnon, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Lili Meszaros, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Colman O’Cathail, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Joana Paupério, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Stéphane Pesant, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Nadim Rahman, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Gabriele Rinck, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Sandeep Selvakumar, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Swati Suman, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Yanisa Sunthornyotin, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Senthilnathan Vijayaraja, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Zahra Waheed, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Peter Woollard, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- David Yuan, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Ahmad Zyoud, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Tony Burdett, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Guy Cochrane, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
Presentation Overview: Show
The European Nucleotide Archive (ENA) is a long-standing, freely-accessible archive for nucleotide sequences and related metadata, covering data types from raw reads to assembled sequences to genome assemblies as well as sample and experiment related metadata and functional annotation. It provides tools for data submission, search and retrieval and is widely used as a database of record for publication and data archival. The ENA archives an average of 24,129 submissions from 31 unique submitters everyday and an average of 349,619 GB of data are downloaded from our retrieval services each month.
The ENA is recognised as an ELIXIR Core Data Resource and an ELIXIR recommended Deposition Database and sits within an ecosystem of integrated bioinformatics services within Europe and globally. The ENA is also a founding member in the International Nucleotide Sequence Database Collaboration (INSDC) with its partners in the National Center for Biotechnology (NCBI) in the United States and the DNA DataBank of Japan (DDBJ) where nucleotide data are globally shared. As a result, ENA data is global with over 92% of the world’s countries represented in the user-base.
We will present the ENA as it sits within the ELIXIR and INSDC landscape. We will also present our approaches to ensure the dataset is more sustainable as well as our intention to engage global players in the world of sequence archiving.
11:40-11:50
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: Alliance of Genomic Resources
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
- Paul Sternberg, Caltech, USA
- Carol Bult, The Jackson Laboratory, USA
- the Alliance of Genome Resource
11:50-12:00
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: InterPro: Bringing together protein families resources for sustainability
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
- Alex Bateman, EMBL-EBI, UK
Presentation Overview: Show
InterPro is a central resource for protein domains and families. It aggregates data from 13 different databases in the field and integrates and further annotates them. InterPro is an ELIXIR Core Data Resource as well as a GBC Global Core Biodata Resource. Since its formation some of the member databases have ceased to operate (PRINTS and SFLD) and so InterPro provides sustainability for these resources in the longer term and has brokered their release with a CC0 license. Over the last years we have worked continuously to increase the efficiency of our pipelines and in the last five years we estimate that the carbon emissions of our compute have reduced from 70tCO2e in 2018 down to a likely figure of just 10 tCO2e for 2023. These increases have been acheived despite an overall increase in the number of sequences searched. Over the last year we also stopped running the Pfam website and merged the content and functionaility into the InterPro website. We see that this approach coul help lessen the burden of web developemnt on other resources in the future. Finally, InterPro provides critical data for UniProt that enables UniProt to provide detailed automatic annotation for millions of TrEMBL entries and reduces duplication of computation.
12:00-12:10
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: CATH: Protein Structure Classification Database
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
12:10-12:20
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: BRENDA: 35 Years of Empowering Enzymology and Beyond
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Peter Maccallum
- Christian-Alexander Dudek, BRENDA (DSMZ), Germany
12:20-12:30
Session: Core Resources at the Heart of Life Sciences
Invited Presentation: SILVA - high quality ribosomal RNA datasets
Room: Salle Rhone 1
Format: Live-stream
Moderator(s): Peter Maccallum
- Jan Gerken, Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures GmbH, Germany
Presentation Overview: Show
SILVA (http://www.arb-silva.de) is a comprehensive resource for quality-controlled datasets of aligned ribosomal RNA (rRNA) gene sequences from all domains of life. It was established in 2007 and became an ELIXIR Core Data Resource (CDR) in 2018. This talk will look at the history of SILVA, how it became a CDR, what effect this has had on SILVA, as well as looking ahead to the future of SILVA as an integral part of the DSMZ Digital Diversity platform.
13:50-14:10
Session: The Federated/Distributed Landscape
Invited Presentation: A Landscape Analysis of Biodata Resources
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Ishwar Chandramouliswaran
Presentation Overview: Show
Biodata resources are wildly variable. Whether small or large, launched in the last year or in the last century, each is part of the distributed open data infrastructure that underpins life sciences research. Because these resources can be created by anyone anywhere at any time, this infrastructure has been difficult to characterize, let alone monitor. Not only does this increase barriers to wide-scale federation and interoperability between resources, fundamental questions about the landscape itself—such as which resources even exist and how they are supported—remain unanswered. This presentation will discuss why an overall characterization of the biodata resource landscape is important, who it’s important to, and current efforts to describe the landscape.
14:10-14:30
Session: The Federated/Distributed Landscape
Invited Presentation: [Federated] EGA: Providing global discovery and access for sensitive human data
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Ishwar Chandramouliswaran
- Aina Jené, Centre for Genomic Regulation, Spain
- Mallory Freeberg, EMBL European Bioinformatics Institute, United Kingdom
Presentation Overview: Show
The European Genome-phenome Archive (EGA) is a service for permanent archiving and sharing of personally identifiable genetic, phenotypic, and clinical data generated for the purposes of biomedical research projects or in the context of research-focused healthcare systems. Over the last 10 years, most individual-level human omics data have been generated in the context of research consortia and shared via global repositories such as the EGA. Many countries now have emerging personalized medicine programmes which are generating data from national or regional initiatives. Thus, human genomics is undergoing a step change from being a research-driven activity to one funded through healthcare initiatives. Genetic data generated in a healthcare context is subject to more stringent information governance than research data and often must comply with national legislation. To address this need, the Federated EGA provides a network of connected resources to enable transnational discovery of and access to human data for research while also respecting jurisdictional data protection regulations. By providing a solution to emerging challenges around secure and efficient management of human omics and associated data, the Federated EGA fosters data reuse, enables reproducibility, and accelerates biomedical research. In this talk, we will describe how the Federated EGA - in the context of European initiatives such as the European Genomic Data Infrastructure and European Health Data Space - aims to deliver a global resource for discovery and access of sensitive human omics and associated data consented for secondary use, through a network of national human data repositories to accelerate disease research and improve human health.
14:30-14:50
Session: The Federated/Distributed Landscape
Invited Presentation: The Coopetition model of collaboration in the NIH Generalist Repository Ecosystem Initiative
Room: Salle Rhone 1
Format: Live-stream
Moderator(s): Ishwar Chandramouliswaran
- Ana Van Gulick
- John Chodacki
Presentation Overview: Show
In February 2022, the NIH Office of Data Science Strategy launched the Generalist Repository Ecosystem Initiative (GREI), which brings together seven generalist repositories (Dataverse, Dryad, Figshare, Mendeley Data, Open Science Framework, Vivli, and Zenodo) to enhance support for NIH data sharing and discovery across generalist repositories. A key component of GREI is “coopetition”, a portmanteau of cooperation + competition, a term invoked to describe the collaboration among the generalist repositories to jointly advance repository functionality and bolster data sharing and reuse.
The repositories participating in GREI are at once both similar and varying. They all support FAIR data sharing across disciplines, strive to adhere to repository best practices, and leverage community standards such as DataCite metadata and persistent identifiers like ORCID. They also include both nonprofit organizations and for-profit companies, repositories built with open source and proprietary infrastructures, and varying features such as data visualization, curation, and controlled access.
The goal of GREI is to establish a common set of capabilities across repositories including common core metadata and metrics, support for key data sharing and search use cases, and training and outreach to position generalist repositories as part of the NIH data sharing landscape. The GREI coopetition model of collaboration fosters the development of a cohesive and interoperable generalist repository landscape where flexible data sharing is supported and data is discoverable across repositories. Simultaneously, coopetition also allows for specific repositories to offer varying features beyond this core functionality such as visualization and analysis, tool integrations, custom metadata, and advanced functionality for specific use cases.
As the NIH data repository landscape grows with the adoption of the new NIH Data Management and Sharing Policy, there are benefits and opportunities to global repositories working together in this way to meet the needs of research communities, funders, and institutions. In the future, discipline-specific data repositories and other research infrastructure providers may also wish to adopt some of the common GREI capabilities to reduce the barriers to data sharing and support greater interoperability across the repository landscape.
14:50-15:10
Session: The Federated/Distributed Landscape
Invited Presentation: CIViC: Accelerating the expert-crowdsourcing of cancer variant interpretation
Room: Salle Rhone 1
Format: Live-stream
Moderator(s): Ishwar Chandramouliswaran
- Obi Griffith, Washington University, United States
-
Presentation Overview: Show
Precision oncology involves the use of prevention and treatment strategies tailored to the unique features of each individual cancer patient and their disease. The number of molecular alterations or “variants” identified as cancer drivers or linked to cancer prognosis, diagnosis, or drug response has exploded. As a result, cancer care-givers are faced with a deluge of patient-specific variants that must be interpreted in the context of a vast and growing biomedical literature describing their significance. Currently, these variant interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Widespread adoption of precision medicine requires this knowledge to be centralized, standardized and expert-curated for application in the clinic. To address this need, we created CIViC, a community-driven web resource for Clinical Interpretation of Variants in Cancer, available online at civicdb.org. CIViC is uniquely distinguishable from other resources due to its fully open access, rich data model, and large community of volunteer expert curators. CIViC has been widely adopted by the community with many individual users, incorporated into numerous academic and commercial workflows, as the official curation platform for ClinGen Somatic variant curation, and as a Global Core Biodata Resource. Due to this widespread adoption, CIViC has seen a dramatic increase in the numbers of users integrating CIViC into their workflows and submissions of new content, which require expert moderation and review. This presentation will discuss recent efforts to accelerate dissemination of high-quality knowledge, automate biocuration and moderation, and engage in outreach, education and collaborative activities to support editor recruiting, training and incentivization.
15:10-15:30
Session: The Federated/Distributed Landscape
Invited Presentation: NIAID Data Ecosystem Discovery Portal: creating a federated search engine to discover infectious and immune-mediated disease data
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Ishwar Chandramouliswaran
- Laura Hughes, Scripps Research, USA
- Meghan Hartwick, NIAID, USA
- Asiyah Lin, NIAID, USA
- Sudha Venkatachari, NCI, USA
- Maria Giovanni, NIAID, USA
- Mariam Namawejje, NIAID, USA
- Nichollette Acosta, Scripps Research, USA
- Candice Czech, Scripps Research
- Emily Haag, Scripps Research, USA
- Jason Lin, Scripps Research, USA
- Everaldo Rodolpho, Scripps Research, USA
- Ginger Tsueng, Scripps Research, USA
- Dylan Welzel, Scripps Research, USA
- Andrew Su, Scripps Research, USA
- Chunlei Wu, Scripps Research, USA
- Wilbert Van Panhuis, NIAID, USA
Presentation Overview: Show
The NIAID Data Ecosystem Discovery Portal is a searchable platform for Infectious and Immune-mediated Disease (IID) research that streamlines access to datasets from IID and generalist repositories – creating a “PubMed for IID Datasets.” IID researchers are enthusiastic about sharing and using data, but IID datasets are difficult to find because searching across diverse data types (e.g., omics, immunological, clinical, etc.) is challenging, and IID data are often stored in repositories which use different data models, metadata standards, and data access protocols. To address these challenges, we developed an IID-focused dataset schema which unites a growing number (>2.8 million) of metadata records in a standardized format from domain specific (4) and generalist (11) repositories. Researchers can use the Discovery Portal to find infectious and immune-mediated datasets, filter results based on user-specified criteria, and access the datasets on the source repository to use in downstream analyses. In the future, we will expand the Discovery Portal by incorporating more repositories, improving metadata quality, and optimizing searching. The Discovery Portal will enable researchers to more readily find data to better understand, treat, and ultimately prevent infectious and immunologic diseases.
16:00-16:20
Session: Knowledge & Impact from Data
Invited Presentation: UniProtKB – a hub for protein knowledge
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Fabio Liberante
- Alan Bridge, SIB Swiss Institute of Bioinformatics, Switzerland
Presentation Overview: Show
The UniProt Knowledgebase UniProtKB (at www.uniprot.org) is a comprehensive, freely accessible, and FAIR resource of protein sequences and functional annotation that covers all branches of the tree of life. UniProtKB is the centrepiece of the UniProt resource and a hub for protein knowledge. It provides a summary of experimentally verified and computationally predicted functional information for proteins using community standard ontologies – like the Gene Ontology (for protein functions) and ChEBI (for small molecules) – and knowledge resources like IMEx (for molecular interactions) and Rhea (for biochemical reactions). UniProt works closely with the developers of these resources – and many more – to enhance their scope and integration both with UniProtKB and with each other. UniProtKB integrates high value datasets – such as predicted structures from AlphaFoldDB, as well as proteomics and variation datasets from many sources – and provides an essential framework to interpret these datasets and derive knowledge from them. UniProtKB has been crucial to teaching AI methods the language of proteins, enabling them to predict protein structures and functions and even design new proteins from scratch, and now provides protein sequence embeddings to support the development of such methods.
In this talk, part of the session on “Knowledge & Impact from Data”, we will look at how UniProt is working with partners to improve knowledge representation in UniProtKB, as well as efforts to scale expert curation from the literature and other sources using AI methods such as language models. We will also discuss the essential role of expert biocurators and expert curated knowledgebases in the era of AI as trusted sources for an open and freely accessible ground truth.
16:20-16:40
Session: Knowledge & Impact from Data
Invited Presentation: National COVID Cohort Collaborative (N3C)
Room: Salle Rhone 1
Format: Live-stream
Moderator(s): Fabio Liberante
Presentation Overview: Show
The National COVID Cohort Collaborative (N3C) is a partnership across US academic medical centers (>232) to harmonize electronic health record data from >19M unique patients across OMOP, PCORnet, ACT, and TriNetX research networks. The N3C has implemented an open team science approach to navigate the societal, technical, regulatory, and clinical obstacles to sharing and analyzing sensitive data. Governance was established through shared decision making between the NIH and the community, resulting in ratified formal policies for data transfer, access, use; a code of conduct; guiding principles; and a publication and attribution policy. The secure Enclave contains data that is the result of an advanced ingestion and harmonization pipeline containing >50,000 transforms with full provenance, supporting reuse of reproducibility. The N3C has pivoted clinical informatics from competitive to collaborative, as it incentivizes building upon each other's work to expedite science. Computational artifacts and concept set definitions are distributed via GitHub and within the Zenodo N3C community. N3C outcomes have been tremendous, with >200 manuscripts/preprints (2043 authors, h-index 23 and ~2K citations in 2 yrs); changed Covid patient care guidelines (in multiple countries); White House & State requests for data; NPR and NIH Director Blog; and the grand prize for data sharing within the NIH & FASEB Dataworks! Competition. Summary data at: https://covid.cd2h.org/dashboard/
16:40-17:00
Session: Knowledge & Impact from Data
Invited Presentation: The Network Data Exchange (NDEx)
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Fabio Liberante
- Jing Chen, University of California San Diego, USA
- Dylan Fong, University of California San Diego, USA
- Keiichiro Ono, University of California San Diego, USA
- Christopher Churas, University of California San Diego, USA
- Rudolf Pillich, University of California San Diego, USA
- Dexter Pratt, University of California San Diego, USA
Presentation Overview: Show
The Network Data Exchange (NDEx) is an online (https://www.ndexbio.org) commons for biological networks where users can upload, share, and distribute networks and where networks can be accessed by applications. The NDEx Integrated Query (NDEx IQuery) is a new tool for network and pathway-based gene set interpretation. It is available at iquery.ndexbio.org and linked to from MSigDB. A cancer-focused version is now integrated with cBioPortal. NDEx IQuery performs multiple gene set analyses based on diverse pathways and networks stored in NDEx. These include curated pathways from WikiPathways, SIGNOR, and the newly updated version of the popular NCI Pathway Interaction Database (NCI-PID v2.0). It also provides a novel
17:00-17:20
Session: Knowledge & Impact from Data
Invited Presentation: Connecting Molecules and Organisations - IMEx Molecular Interactions and Reactome Pathways
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Fabio Liberante
- Henning Hermjakob, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI)
- IMEx Consortium
- Reactome Consortium
Presentation Overview: Show
Biomolecular interactions are the fabric underlying almost all processes in living organisms, and they are determined by a broad array of experimental approaches, from focussed studies of pairwise interactions to large-scale determination of 10,000s of interactions in standardised high throughput experiments. However, observed molecular interactions are highly dependent on the biological and experimental conditions under which they are determined. Cellular systems, experimental protein tags sequence modifications, and experimental approaches all heavily influence the observed interaction.
Since its inception in 2005, members of the International Molecular Exchange Consortium (IMEx) (1) have co-ordinated their approach to the systematic capture of molecular interaction data, increasing their collaboration level from initially defining common file formats via shared curation rules, to now sharing a single, web-based curation platform, and jointly responding to challenges like the Covid-19 pandemic.
The IMEx curation policy emphasises a fine-grained data and curation model, aiming to capture the relevant experimental detail essential for the interpretation of the provided molecular interaction data. Based on the detailed annotation of experimental methods, a confidence score is calculated for all interactions, supporting application-specific subsets of the IMEx interaction data. As an example of tight database collaboration across organisations and consortia, we demonstrate the integration of IMEx data in the Reactome database of manually curated human pathways, increasing Reactome coverage by ca. 40%.
17:20-17:30
Session: Knowledge & Impact from Data
Invited Presentation: Orphadata Science: a global core data resource for rare disease knowledge
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Fabio Liberante
17:30-17:40
Session: Knowledge & Impact from Data
Invited Presentation: The STRING Database: A Comprehensive Functional Annotation of Non-Model Organism Proteomes
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Fabio Liberante
Presentation Overview: Show
The STRING database is a highly-utilized resource of predicted and known protein-protein interactions (PPIs), encompassing both physical and functional associations. In addition to producing protein networks, STRING serves as an enrichment tool that integrates multiple protein function annotation ontologies, thus providing an in-depth understanding of the functional relationships within a gene set. The resource encompasses more than 12,000 fully sequenced genomes.
In its latest update, STRING has expanded its functionalities to allow any user to upload any organism to the database requiring only the initial set protein sequences (proteome). Through orthology predictions, STRING generates not only the network but also functional annotations of all the proteins in multiple widely-used ontologies. For these uploaded proteomes STRING offers a complete web-suite, including graphical UI, bulk-download, enrichment toolset and the API access.
This newly implemented upload functionality is designed to minimize the disparity between pre-existing organisms in the STRING database and those newly added by users. The annotation pipeline is entirely accessible via the graphical web UI, with an emphasis on user-friendly interaction and requiring minimal user input. The uploaded proteomes and the resulting annotated datasets are persistent and sharable between all the users.
17:40-18:00
Session: Knowledge & Impact from Data
Invited Presentation: Europe PMC - connecting the literature to data
Room: Salle Rhone 1
Format: Live from venue
Moderator(s): Fabio Liberante
- Melissa Harrison, EMBL-EBI, United Kingdom
Presentation Overview: Show
Europe PMC is an open-access repository of full text life sciences research articles; it contains all of PubMed and PMC, collaborating with the National Library of Medicine USA to share full text content. It supports 37 funders in their open access policies by acting as a deposition database for accepted manuscripts, converting them to full text. Since 2018 it has indexed preprints, to-date covering over 30 preprint servers and indexing over half a million preprints, 48 thousand of which are full text. Europe PMC enriches the literature with links to data, text-mined biological annotations, funding, ORCIDs, and more. Its core text and data mining pipeline mines accession numbers and data DOIs from 47 data repositories, resulting in over 8 million annotations available in the annotations API and for search via the website and viewable via the Scilite web app. The annotations platform receives external contributions and contributes to annotations for: genes/proteins; chemicals; organisms; diseases; gene ontology; resources and experimental methods.
Europe PMC’s mission is to support innovation and discovery by engaging users, enabling contributors, and integrating related research outputs and current and future developments will also be covered in this talk.