FoX TobysSRB ccViz Pelote eMinerals site map

Environment from the Molecular Level

A NERC eScience testbed project

XML and the Chemical Markup Language

Very brief basics

XML (eXtensible Markup Language) is a framework for data representation, on which domain-specific languages can be built. The key feature of XML is that any piece of data will be described by surrounding open and close tags, such as

<example>
<exampleTitle>
"This is the title of an example"
</exampleTitle>
<exampleLabel Label="first" />
</example>

Provided that one understands the meaning of an example and the exampleTitle and exampleLabel fields, the information content is clear and unambiguous.

This example illustrates a number of simple ways of marking up. The exampleTitle data object is encased within start and end brackets (<example> and </example> respectively). On the other hand, the exampleLabel data object carries one parameter, namely Label. This second example of a data object encapsulates all data within a single <.../> bracket, which is a shorthand that could actually be expanded into a pair of start and end brackets.

The challenge to the XML user is to devise a scheme that enables his/her data to be represented using this approach.

Chemical Markup Language, CML

The eMinerals project makes considerable use of CML developed by Peter Murray-Rust and Henry Repza. CML was one of the first XML lanaguages, which has been both refined and expanded in recent years.

A simple example extract from a CML file is:

<?xml version="1.0" encoding="UTF-8"?>
<cml xmlns="http://www.xml-cml.org/schema"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<metadataList>
<metadata name="identifier" content="DL_POLY version 3.06 / March 2006"/>
</metadataList>

<parameterList title="control parameters">
<parameter title="simulation temperature" name="simulation temperature"
dictRef="dl_poly:temperature">
<scalar dataType="xsd:double" units="dl_polyUnits:K"> 50.0 </scalar>
</parameter>
</parameterList>

<propertyList title="rolling averages">
<property title="total energy" dictRef="dl_poly:eng_tot">
<scalar dataType="xsd:double" units="dl_polyUnits:eV_mol.-1"> -2.7360E+04
</scalar>
</property>
</propertyList>

</cml>

The basic usage is to put data into various List tags, with definitions of terms held in dictionaries. Specifically, we collect metadata, simulation parameter and computed property data:

  1. The first block of data tells any program reading the file that the data are represented as an XML file. The second line starts the CML content, with the locations of the schema files provided as parameters.
  2. The metadataList within the second block contains some very simple metadata associated with a specific job. The content field contains information abut the simulation code (name, version, date).
  3. The third block of the file constains the parameterList data, namely copies of some of the input parameters that controlled the simulation. The meaning of the parameter, although obvious in this example, is actually defined in an external dictionary. Note that CML requires that numbers are matched with units.
  4. The fourth block of the file contains some data derived from the simulation, here called properties and enclosed within a propertyList tag. The same comments given for the parameterList data apply equally in this case.
  5. The final part of the file is the closure of the <cml> tag, and is essential.

In practice XML/CML files will be much more extensive than this simple example (some examples are given here and here), but they will follow the same basic form. It needs to be stressed than XML files need to adhere strictly to a set of rules; not doing so will inevitably lead to ambiguities and hence difficulties that are avoided through enforcement of the rules!

The advantages of XML

The main advantage of XML is that data can be presented in a self-defining and unambiguous manner. The benefit of this is that data can be read by programs without the programmer needing to be aware of strict data formats. Data interoperabiliot between codes is one of the regular nightmares (or bad dreams at best) faced by the simulation scientist. A consistent use of XML removes many of the problems with data formats. Although XML files need to follow a set of rules, in practice provided the file is good XML that actual organisation of the data need not matter. All information about definitions may be held in separate schema and dictionary files.

Example

One of the applications of CML within the eMinerals project is to enable representation of data in a way that can easily be understood by collaborators.

We have a tool ccViz, that will take a CML output file and trasnform the information into an xhtml file that can be viewed with a standard web browser. Data in the form of tables of numbers are transformed to SVG (Scalable Vector Graphics, another XML language) for instant plotting. Examples are given here and here.

Many of our simulation codes now generate a CML output file. Since most of our codes are written in Fortran (typically F90 or F95), and scince Fortran doesn not have any native XML tools, we have created a library of XML-aware libraries for Fortran called FoX.

General references

Some papers that describe work that the eMinerals project has carried out on CML can be obtained as pdf dowloads from the following links:

"Application and Uses of CML within the eMinerals project". TOH White, P Murray-Rust, PA Couch, RP Tyer, RP Bruin, IT Todorov, DJ Wilson, MT Dove, KF Austen.

"Towards Data Integration for Computational Chemistry." PA Couch, P Sherwood, S Sufi, IT Todorov, RJ Allan, PJ Knowles, RP Bruin, MT Dove and P Murray-Rust. Proceedings of All Hands 2005 (ISBN 1-904425-53-4), pp 426–432, 2005

"The use of XML and CML in Computational Chemistry and Physics Programs". A. GarcĂ­a, P. Murray-Rust, J. Wakelin. Proceedings of the UK e-Science All Hands Meeting 2004, (ISBN 1-904425-21-6), pp 1111-1114, 2004