e-CCP1 : The CCP1 e-Science ProjectPhilip Couch, Paul Sherwood, Peter Knowles, Huub van Dam, Robert Allan and Martyn Guest
Traditionally quantum chemists, calculating the properties of molecular matter from first principles, have been at the forefront of the exploitation of computer hardware. However, the discipline has not seen the same level of innovation when it comes to the introduction of new informatics approaches as exemplified by, for example, the bioinformatics community. Despite the ubiquity of ab-initio programs such as Gaussian, MOLPRO and GAMESS, there are no agreed data standards for the transfer of many of the input data and results, much less intermediate data. There is also no way to archive the computational parameters in a manner that can be readily (let alone automatically) regenerated for subsequent calculations. Most workers in the field manage input and output using only their computer O/S file system with very little supporting tools. The availability of data exchange standards would open up the possibility for users to take advantage of particular features of different packages, rather than performing the whole calculation in a single package. The issue of data standards is now being addressed within CCP1 by collaboration with the CCLRC e-Science Centre. In the long run, this has the potential to change the way software is designed and to maximise collaboration within the community, this is one of the primary aims of the CCPs. The DL-based CCP1 support staff played a major role planning this project (now known as e-CCP1). The project RA Phillip Couch began work in the e-Science centre on the 1st Sept 2003. The first full project meeting involving e-Science Centre and CCP1 members (Profs P.Knowles and P.R.Taylor) was held in October 2003. Initial effort has focussed on protocols for data exchange based on the XML language (e.g. CML for molecules and associated data) as well as familiarisation with the CCP1 GUI developments. It is intended that a broad international, consultation will now be performed to consider the best way to reach consensus and define de-facto standards for data exchange for objects such as geometries, basis sets and wavefunctions. A meeting was held in Edinburgh on the 5-6th April 2004. Since then a number of data models and approaches to representation have been investigated. A component oriented approach has been taken for the design of the data model, making it easily extensible and re-usable by different groups. The relationship between the data model and some of the documents and tools is illustrated below. Where possible, components are built upon the Chemical Markup Language and its extensions, such as CMLComp (http://cml.sourceforge.net). This enables compatibility with tools that already support this format. Since quantum chemistry is a computationally and data intensive discipline, there are a number of technical challenges. One such is the handling of large datasets which will need to be held in binary form; a number of possible technologies exist for this, including NetCDF, HDF, BinX and DFDL.
The eCCP1 data handling scheme Concurrently with the development of the data model we are also looking into approaches for storing and handling the data. Once we have standard representations and tools it becomes possible to greatly increase data re-use and data sharing among the community, both within closely collaborating communities and globally (as the WWW has done for document sharing). As consensus is reached on parts of the data model, we will provide a reference implementation which allows existing software to read and write data in the new representation. In view of the current quantum chemistry software base, it is likely that Fortran and Python will be the first languages to be addressed. An important part of that reference implementation will be the visualisation and model-building tools also under development within CCP1, as discussed in the Section 5.5. The production of tools for a reference implementation requires consideration of the difficulties introduced by modification/extensions to the current data model and the longer term need to understand data from other communities. Parsers can be implemented to work with a specific data model, but this approach produces overheads, through code adaptation, as the data model changes. An improved approach involves writing a parser generator, which writes parser code from the data model. A second approach is to create a logical description of the data, and a mapping from the logical to physical description (XML schema). The eCCP1 project is adopting technologies from the semantic web community to provide these descriptions and mappings. In particular, the Ontology Web Language (OWL) is being considered for a description of the classes of objects and their properties (objects include molecules, atoms and atomic basis sets). RDF is then used to provide mappings between these classes and the physical description of the objects; these mappings make use of W3C standards (XPath, XSLT). The parser uses the RDF documents to find out how to locate objects and their properties in XML documents, and how property values should be returned to the user. This approach allows a great deal of flexibility, and the parser is being written in such a manner that an in-depth knowledge of the technologies involved is not required. Consideration must be given not only to the representation of the quantum chemistry components, but also to the representation of the relationship between these components. An example is the specification of the mapping between atomic basis sets and atoms that form a molecule. The eCCP1 project is investigating the use of W3C standards for linking components and providing information about the purpose of the link (XLink, XPointer). Suggestions of different approaches, and a discussion of the relative merits of each, can be found on the projects TWiki site. Getting InvolvedIf you are involved in the development of quantum chemistry software or are an interested user, you are encouraged to get involved in the discussions on data model development now. This will maximise the chance that the data models will serve the needs of a wide community and thus encourage uptake and our target of interoperability. ( The eCCP TWiki ) is a web site which allows users to register and then contribute to the content.We have posted draft data models and the derived schemas and started a mail list (described there). |