ISCB Response to NIH Proposed Data Science Strategic Plan

To capitalize on the opportunities presented by advances in data science, the National Institutes of Health (NIH) is developing a Strategic Plan for Data Science. This plan describes NIH's overarching goals, strategic objectives, and implementation tactics for promoting the modernization of the NIH-funded biomedical data science ecosystem. As part of the planning process, NIH has published a draft of the strategic plan today, along with a Request for Information (RFI) to seek input from stakeholders, including members of the scientific community, academic institutions, the private sector, health professionals, professional societies, advocacy groups, patient communities, as well as other interested members of the public. ISCB's Public Affairs and Policy Committee submitted a response to the drafted strategic plan on behalf of ISCB.

The Committee found that the plan provides a framework to deal with data science challenges and recognizes the importance of data sciences to the overall success in meeting the NIH mission. Continuing the emphasis areas of Big Data to Knowledge (BD2K) initiative, it is highly positive that the Plan recognizes data standards, interoperability, infrastructure, and training as critical focus areas. In addition, the goals of defining different and appropriate funding mechanisms and reviewing criteria for data-science efforts will resolve some major impediments within the current NIH funding ecosystem.

However, as outlined in the suggestions below, the Proposed Strategic Plan needs to provide a more concrete and detailed roadmap in its rather unique treatment of:

1) distinction of databases and knowledge bases and their funding models;
2) separation of software tool development from repositories and curation efforts; and
3) a limited representation of the challenges of data use globally. Considering that databases, information portals, knowledge bases, and tool developers represent a significant portion of a highly heterogeneous scientific data ecosystem, imposing binary funding distinctions may lead to dysfunctional or simply unworkable solutions.

The conclusion (or view) of the ISCB is that the Proposed Strategic Plan is a much needed step in the right direction, but that before long term funding policy decisions are made, there is a need for strong community input and buy-in to expand on the specifics and possible unforeseen consequences of vague definitions and distinctions as discussed in the changes suggested below.

Eliminate Separation of Databases from Knowledge Bases

We recognize that the distinction between "databases" and "knowledgebases" is stressed in the Strategic Plan as a way of differentiating funding mechanisms. However, the wide variety of information portals renders such a binary distinction highly artificial. The Strategic Plan states that a DB makes available the "core data" (no definition is offered) of some biological system, whereas a KB organizes information "related to core datasets," and states that KBs typically require significant curation whereas DBs do not. Model organism DBs (MODs) are cited as an example of DBs, yet MODs have undergone hundreds of person-years of curation efforts – an apparent contradiction. MODs are Knowledgebases. Perhaps the confusion is between 'data stores' where data are not curated and there is minimal metadata assignments, and 'knowledgebases' where many relevant data stores are integrated and curated soas to facilitate the full use of the data for computational analysis.

This notion that "core data" is the key differentiating feature is not widely accepted among practitioners in the field and is quite vague. For example, the transcriptome is listed as core data (belonging to DBs), whereas an expression pattern is listed as belonging to KBs. But no clear separation into core versus related-to-core data is obvious in the following list of datatypes present in one well-known biological data/knowledge-base: genes, promoters, transcription factor binding sites, terminators, operons, metabolites, enzymatic reactions, transport reactions, metabolic pathways, gene essentiality data. Since the Strategic Plan now seems to advocate that no efforts should be funded that combine both core data and non-core data, presumably the preceding data/knowledge-base project (and most MODs as well) must be divided into two separate projects. Such an element of the Plan would create a major obstacle for the information integration that is so valued by end users. In reality, a continuum exists between DBs and KBs, and attempts to find a reliable place on that continuum to define a funding policy would prove challenging and troublesome and may have to be abandoned.

Proposed Separation of Software Tool Development

We see the proposal to "Separate support for tools development from support for databases and knowledgebases" as quite problematic because in practice the developments of many tools (software) and DBs/KBs are tightly intertwined and their separation may be both impossible and ill advised. Often, tools developed independently from the participation of the DB/KB community fail to actively incorporate data and data updates and fall into dis-use.

For example, if a grant application is designed to develop the first DB for metabolomics data, it would be critical for this effort to develop software for parsing submitted data, validating submitted data, enabling curators to add and modify descriptions of the experimental conditions, storing submitted data to a database management system, and for powering a user website that allowed users to submit queries and view query results. Without such software, the DB could not be populated, checked for accuracy, nor made available to users. Without the software, there is no database! It is not clear whether the Strategic Plan mean s to imply that every DB/KB project must involve two grant applications, one for the software and one for the DB/KB, but we do not consider this to be an advisable process.

Another problem with this idea is the apparent underlying assumption that any third party can easily write software tools for a given database. This is not the case because e very database has a schema – a precise computer definition of each type of data stored in the database (e.g., genes, proteins, metabolites) , and t he schema for a given database will change over time. Each software tool written for a given database must manipulate the data using exactly the same schema as the database currently uses, otherwise the database and the software will be incompatible.

We suspect that one motivation for the separation idea might be the notion that DB/KB grantees would include software development tasks to try to bolster the innovative appearance, and yet reviewers who appreciate the importance of the database might feel that the entire project needs to be scored highly, even when the proposed software is weak, to ensure the funding of an important database. Based on such assumptions, we suggest three alternative ways to view and solve this problem:

(1) The assumption seems to be that DB/KB applicants frequently include poor quality tool development tasks in their proposals. It is likely that in most of the proposals the inclusion of high quality tool development tasks leads to high (better) scores because the proposed tasks are excellent. Do data exist regarding the frequency of high versus low-quality tool components in DB/KB applications?
(2) If NIH develops improved review criteria for DB/KB applications that decrease the weighting and/or necessity of innovation, both grantees and reviewers will need to put less emphasis on innovation, which should largely solve the problem.
(3) Particularly for large projects, reviewers should be encouraged to recommend excising project elements that they consider poor quality, to enable awarding of high scores to the remaining project elements. Imagine a DB application with an excellent operational plan but a weak plan for developing an innovative software tool. Everyone is served by continuing funding for the DB as a whole but not funding the software tool: users enjoy continued operation of the DB; the project retains highly skilled staff members. The grantees should be given an opportunity to resubmit the component for the software tool for later funding consideration. But: excision of project components must be performed judiciously.

ISCB News and Announcements

ISCB Response to NIH Proposed Data Science Strategic Plan

ISCB On the Web