Introduction to the Core Scientific Metadata Model (CSMD)

Brian Matthews, 2012

An Information model for facilities data management

Metadata is a key factor in the archiving and distribution of scientific data. Through the use of good metadata models, defined at the appropriate level, scientists can publish and share data, and allow the results of experiments and studies to be browsed, searched and cited. Appropriate metadata can thus encourage the reuse of data within and across scientific disciplines. We propose that metadata should be designed to fit within an information model for data collection within the facilities process Current work on information modelling within the PaN-data consortium is focussed on the Core Scientific Metadata Model (CSMD) [Matthews et. al. 2010], which has been designed within the context of the ICAT Data catalogue tool [Flannery et. al. 2009], although it is also implemented within other systems such as the Tardis system [Androulakis et. al. 2008] used within Australian facilities.

CSMD as an Information model

The Core Scientific Metadata Model (CSMD) is a study-data orientated model which has been developed at STFC over the last 10 years; for earlier work see [Sufi & Matthews 2004, 2005]. The CSMD is being used as the core metadata model within the data management infrastructure which is being developed for the large scale scientific facilities supported by STFC including the ISIS Neutron Source and the Diamond Light Source. It has been designed to capture information about experiments and the data they produce within facilities science, and has been the result of an analysis of science practice over a number of years and a number of projects to allow the user to search for interesting data. The currently implemented version is based on the CCLRC Scientific Metadata Model v2 [Sufi and Matthews 2004] and further modifications.

In particular, it is a key aspect of the ICAT, a software suite designed to manage the cataloguing and access to facilities data [Flannery et. al. 2009]. However the model is intended to capture high level information about scientific studies and the data that they produce and is thus designed to be generic across scientific disciplines and has application beyond facilities science, particularly in the "structural sciences" (such as chemistry, material science, earth science, and biochemistry) which are concerned with the molecular structure of substances, and within which systematic experimental analyses are undertaken on material samples.

The model is organised around a notion of Studies, a study being a body of scientific work on a particular subject of investigation. During a study, a scientist would perform a number of investigations e.g. experiments, observations, measurements and simulations. Results from these investigations usually run through different stages: the collection of raw data, the generation of analysed or derived data through the application of software tools, and end results. Data should be grouped accordingly, and associated with the appropriate experimental parameters. Not all information captured in specific metadata schemas would be used to search for this data or distinguish one data set from another, give possibility to select special parameter. The CSMD is designed to be a common general format/standard for Scientific Studies and their associated data holdings.

Thus this model:

The CSMD has been developed to be a core system which is extensible and can be specialised to particular scientific domains, so it does not make assumptions about the specific terminology of the domain.

The Core Information Model

CSMD is organised around a notion of Studies, a study being a body of scientific work on a particular subject of investigation. During a study, a scientist would perform a number of investigations e.g. experiments, observations, measurements and simulations. Results from these investigations usually proceed through different stages: raw data is generated, this is then analysed to produce derived data and which then may be refined to an end result suitable for publication.

The model thus defines a hierarchical model of the structure of scientific research around studies and investigations, with their associated information, and also a generic model of the organisation of data sets into collections and files. Specific data sets can be associated with the appropriate experimental parameters, and details of the files holding the actual data, including their location for linking. This provides a detailed description of the study, although not all information captured in specific metadata schemas would be used to search for this data or distinguish one data set from another.

The core entities of the CSMD for a study are summarised as follows.

The Metadata Structure

The metadata within the general structure is laid in a series of classes and subclasses. We do not describe the whole model in detail for reasons of space, but rather select some areas of particular interest.

Modelling Scientific Activity

The data model describes scientific activities at different levels: the main unit is the Study, which optionally can lie in a context of a science research programme, governed by policies. Each study has an Investigator that describes who is undertaking the activity, and the Study Information that captures the details of this particular study. Studies include particular scientific investigations.

Each investigation has a particular purpose and uses a particular set up of instruments or computer systems.

Classes within the model have several fields. For example, investigator has a name, address, status, institution and role within the study. For reasons of space we do not provide a complete description of all the available classes within the metadata model. For illustration, we consider the Study class. Within a Study, there are several fields, as in table below.

ID The key of the Study
NAME Unique name given to the study.
PURPOSE Description of purpose of study, an abstract of why these investigations are brought together.
STATUS Ongoing or complete, as there could be additional investigations planned in the future which could be applicable to this study.
RELATED MATERIAL Information related to the study. This could be related studies in other facilities, or on similar samples.
STUDY_CREATION_DATE When the study was created.
STUDY_MANAGER The user who has created the study; this may not be the investigator, but rather a member of the facilities staff.

Further links in the study relate to the specific investigation. An investigation has fields for the investigators involved, together with their role and their contact details, and also references to the facility and instrument used to capture the data.

Modelling scientific data holdings

Investigations are characterised by the generation of a particular set of data on the analysis of a sample, initially raw data, but then further data representing analysed data. Other data may also be associated with the investigation, such as calibration data. Each data set may have different parameters set. The model of data holdings used in the model needs to accommodate this complexity.

In CSMD each investigation is associated with metadata describing the data holding associated with that investigation. The metadata format given here is designed for use on general scientific data holdings, describing data logically which may be physically moved around. Thus, data holdings have three layers: the experiment, the logical data, and the physical files. Data holdings are considered as hierarchies, with data sets, which can contain sub-datasets which can be broken down into individual logical data files, generalised in the model as Atomic Data Objects (ADOs), as they may not be held in file store, but in for example databases. At each level of granularity, metadata can be provided giving representation information (as in OAIS) at the appropriate level of the data holding. At each stage of the data collection process, data is stored in a set of physical files with a physical location.

It is possible that there may be different versions of the data sets in the holding. In a general data portal, all stages of the process should be stored and made available as reviewers of the data holdings may wish to determine the nature of the analysis performed, and other scientist may wish to use the raw data to perform different analyses. Thus type markers ('raw', 'intermediate', 'final') need to be kept with data sets and ADOs and relationships between them recorded.

The model distinguishes between the logical data holding, describing the data objects and their structural hierarchy, and the data location. The data location provides a mapping between the identifiers used in the data definition component of the metadata model, and the actual URL's of the files. This can provide facilities for describing mirror location for the whole structure, and also for individual files.

Parameters

Parameters can be associated with data holdings, data sets, or ADOs. The same metadata item is used to represent either experimental conditions and measured items stored as data points in the data collection, but are distinguished via a parameter type qualifier ('fixed' or 'measured'). Each parameter has a set of fields describing its name (e.g. temperature, pressure), its value (if fixed in as an input parameter), the units of measurement used to qualify the data points, the range of values over-which a parameter can take and the error margin expected on the value.

Reference implementation

The ICAT system for cataloguing facility-generated experimental data has been in development within STFC over several years for in use at both the ISIS Facility and the Diamond Light Source. It forms part of an infrastructure supporting data management across the scientific lifecycle, and is now an open-source development project.

The ICAT Infrastructure

An integrated approach has been taken to provide data infrastructure within STFC. A core component is an information catalogue - the ICAT - which collates metadata about the experiment from different stages of the experimental lifecycle by integrating with a number of different systems supporting that stage, from proposal to publication. Thus systems which could be integrated across the lifecycle would include:

The ICAT collects metadata across the experimental lifecycle as automatically as is possible by interacting with a number of other associated systems almost all of which already exist as part of the operating environment. Thus core metadata on teh experiment context is collected from the proposal system, information about parameters from data acquisition, etc. Thus the system is efficiently propagating metadata through the system, maintaining accuracy and completeness, and negating the need for retyping. There are a number of features which ICAT needs to accommodate to support its user community.

The core component of the ICAT tool set, the ICAT itself, is a database storing the metadata associated with scientific resources. This provides a well defined API that provides a uniform interface to experimental data and a mechanism to link all aspects of research from proposal through to publication. This is published as a web-service interface and allows bindings which could be in any languages so that end user applications can interact with the ICAT.

A web-based front end ("TopCat") provides an alternative standard interface allowing browsing and searching of the catalogue and access to the experimental data, accessible from within the facility and at their home institution.

The the ICAT interfaces to the data storage system, for example to a virtualised file store on a mass-storage system; this can be tailored to other storage systems - this functionality has been separeated out into the ICAT Data Service (IDS). There are also interfaces to the user database and single sign-on systems which control user identification and authentication within the facility. The ICAT can also be linked to other systems which supply it with data, especially the proposal system, initiating investigations. Further interfaces to e-Science services such as high-performance computing or visualisation, to the publications system cross-linking with publication data and software libraries can also be added.

Conclusions on the Information Model

The information model which has been developed within the CSMD has proven a robust and tested model for capturing information about experiments and the associated raw data. It is simple and reasonably generic, while allowing sufficient detail to provide users with a manageable search and discovery interface.

References