Environment from the Molecular Level
A NERC eScience testbed project
With grid computing it is inevitable that scientists will generate much more data than they are used to handling. This raises two issues: management of the physical components of the data (the electronic files), and management of the information contained within the data. These two issues both pose a third issue, namely that of handling metadata.
Physical and logical data management
The eMinerals project uses the Storage Resoure Broker (SRB) at the heart of our data management strategy. The components of the SRB are a set of distributed data storage resources (called Vaults) and a central metadata catalogue (MCAT). Information about each file (including physical location) is stored in the MCAT, and the SRB client tools present the user with a single logical file system, in which the physical location of a file is reduced to the status of a file attribute.
We have data vaults in London, Bath, Reading and Cambridge, with a combined storage capacity of around 3 TBytes. It is important to note that the data storage requirements of the eMinerals project are primarily the need to manage many moderate size files than a smaller number of very large files.
The client tools include a set of unix commands called Scommands, a Windows GUI called InQ, a web interface called MySRB, and our own web interface called TobysSRB. The latter tool provides some important productivity improvements over MySRB for the eMinerals project team members.
Information management and information delivery
By enabling high-throughput combinatorial studies, grid computing generals so many files that it becomes increasingly difficult to both manage the information contained within the files and to share the information in a meaningful way.
One key component for managing information is to use XML for data representation. We make use of the Chemical Markup Language (CML) in many of our simulation output files. We have developed a Fortran Library, called FoX, to enable simulation code developers to easily incorporate writing of XML/CML output files. The TobysSRB client tool has the feature that it can perform XML to XHTML transformations on the fly, with incorporation of SVG graphics to give graphs of time-series and step-wise data. In this way, data files can be read by collaborators with understanding of the information content without needing to understand the raw output formats of the code used to generate the data.
XML is particularly useful to enable information to be retrieved by other computer programs. For example, a combinatorial study involving a sweep across a range of parameters may generate one file per parameter value, each containing one quantity of interest. Using our tools it is possible to sweep across all output files, extract the parameter value from each file, and collate the data into one file for plotting or subsequent analysis.
Recently the eMinerals project has developed a new metadata framework called the RCommands. This consists of a central metadata database, a set of unix line commands, and a web interface. The RCommands enable us to create and manage metadata for both files and collections of files, and then to search for specific data files or collections using the metadata.
Papers that describe the eMinerals work on data management include the following, which can be downloaded as pdf files:
Collaborative grid infrastructure for molecular simulations: The eMinerals minigrid as a prototype integrated compute and data grid.
M Calleja, R Bruin, MG Tucker, MT Dove, RP Tyer, LJ Blanshard, K Kleese van Dam, RJ Allan, C Chapman, W Emmerich, PB Wilson, JP Brodholt, A Thandavan, VN Alexandrov.
Molecular Simulations 31, 303313, 2005
eMinerals: Science Outcomes enabled by new Grid Tools.
M Alfredsson, E Artacho, M Blanchard, JP Brodholt, CRA Catlow, DJ Cooke, MT Dove, Z Du, NH de Leeuw, A Marmier, SC Parker, GD Price, JMA Pruneda, W Smith, I Todorov, K Trachenko, and K Wright.
Proceedings of All Hands 2005 (ISBN 1-904425-53-4), pp 788795, 2005