Designing XML and XML Schemas for Bioinformatics using UML

Philip Burton1, Russel Bruhn2
1pjburton@ualr.edu, University of Arkansas at Little Rock; 2rebruhn@ualr.edu, University of Arkansas at Little Rock

Extensible Markup Language (XML) will be used increasingly by the bioinformatics community to store and transmit structured data over the internet. XML is data that is marked up by tags in a manner similar to Hyper-Text Markup Language (HTML). Whereas HTML markup is for formatting or presenting data, XML is for organizing and structuring data. The user-defined choice of elements and attributes, the type of data contained within them, and the way elements nestle within each other determines the structure of the document. This information is recorded as a set of rules in a schema. At the present time, most XML documents use a Document Type Definition (DTD) for this purpose. The DTD is an older form of schema inherited from the Standardized Generalized Markup Language (SGML) and is now being replaced by a new standard called the XML Schema Definition (XSD) language from the W3C. One way of designing an XML document, and its associated schema, is to use the Unified Modeling Language (UML) to display the data objects and their relationships graphically. In this paper, which is aimed at a general biological audience, we illustrate the process of creating an XML document, from scratch. We take some data from the bioinformatics literature and apply the method of Routledge, et al (Routledge, et al, 2002) to illustrate the data model using UML diagrams. The advantage of this approach is that the process of choosing the elements and attributes (which is an issue in XML) and the design of the schema are unified. Furthermore, the fact that the first step in the process is done at the conceptual level allows domain experts like biologists to participate in the choice of the elements and attributes. No technical knowledge of XML Schema is required.