ADAPTER: An Adaptive Data Format for Particles
Project Motivation:
Increasing amounts of data are being produced by computational science simulations and from scientific measurements. This deluge of data is slowing the speed at which new scientific knowledge can be discovered as scientists struggle to extract new information from massive multi-Terabyte data sets. The ADAPTER project investigated how to manage this data deluge for particle data sets, generated for example, by supercomputer simulations of the origins of the Universe by Cosmologists, or by LIDAR scans of mountain ranges in Earth Sciences.
(a) A particle data set from a cosmological simulation tracing the origins of structure in the Universe from the big-bang to the present day.
This type of data output from super-computers is often monothilic in structure and does not allow easy access to parts of the data, nor to varying resolutions of the data when the full output is not required. (ICC, Physics, Durham).
(b) A particle data set from a LiDaR scan of the Raz Hanzir sea cliffs in Malta. This is used to analyse fracture structure, for example, to predict carbon storing properties of rock.
This type of particle data is scanned rapidly by rotating laser scanners which measure distance to each point while at the same time photographing the surface to give a colour value for each measured point. (ITF FR3DA project, Durham)
Figure 1: Two different applications for particle data, each particle represents completely different physical data, however underlying this the computational problems are identical, how can particles and their associated properties be represented and analysed efficiently. The ADAPTER project created a new software system that demonstrate how huge data sets can be stored so that scientists can focus on the more important aspects directly relevant to the information they are trying to extract. A key component of this is a new data format based on a "multi-resolution parallel kd-tree". This new format allows much faster access to data than previously possible and, by sorting the data and distributing it across a small cluster of computers, a low resolution view of the data can be previewed. Subsequently those regions of particular interest can be extracted and queried at higher and higher resolution. In this way the scientists can see both the wood and the trees as needed.
ADAPTER enables this investigative process to happen at interactive speeds unlike existing tools which can take many tens of minutes to satisfy a single query, and/or require a large supercomputer. The project has demonstrated that the ADAPTER design works and results have been presented to scientists, computer scientists and to the general public.
Technical Overview:
The goal of this project was to build a proof-of-concept software tool to help manage the ever growing size of data sets in science and specifically particle data sets which have attributes attached to each particle and which might evolve over time, for example in the cosmology simulation outputs shown in our existing 3D movie "Cosmic Cookery".
Scientists often store their data in self-describing data file formats such as Hierarchical Data Format (HDF), Network Common Data Format (NetCDF), Planetary Data System (PDS) and Flexible Image Transport System (FITS). These file formats improve the storage and retrieval of large multi-dimensional arrays. However, they do not support multi-dimensional spatial indexing or semantic indexing e.g. querying for all particles with attribute X within a specified range. Another growing issue is the lack of support for multi-resolution indexing, storage and retrieval functions. As the size of the latest scientific datasets is growing rapidly, multi-resolution support becomes ever more desirable. A number of projects have addressed these problems individually; however, as yet there is no solution for a multi-resolution data format with spatial indexing features for unstructured particle data. The ADAPTER solution is to use a multi-resolution kd-tree as a spatial index into the data, as shown below:
In order to achieve multi-resolution indexing we repeat a process of sub-sampling, division and indexing of the dataset. Initially, the entire dataset is coarsely sampled and indexed. Then each leaf node in this index represents a region or volume (assuming spatial indexing) of the entire dataset; the remaining non-indexed data is then divided up into these regions, sub-sampled and indexed again. The leaf nodes from the first index are then linked to the appropriate smaller region indices. These steps are repeated until no further data is left to be indexed, which depends on the sampling rate at each stage. Larger datasets require more levels of sub-sampling and indexing so that there is a sufficient range of multi-resolution access.
The advantage of this approach over simply building a single large k-d-tree index for the entire dataset is that this method allows data from a region of interest to be accessed with minimum amounts of pruning, clipping and I/O. For example, with a single large index, in order to access a region of interest at a coarse resolution the data from the intersecting leaf nodes will have to be read from the disk at full resolution, clipped against the boundaries of the region of interest, and then pruned or sub-sampled, which leads to a lot of needless I/O and computation being performed, especially when the desired resolution is very coarse.
The subsampled data is randomly distributed across as many CPU nodes are available while the index, which is designed to be very small, is replicated on every CPU node. This allows rapid selection of subsets of the data (eg spatially) and allows the selection of data at different levels of resolution. The ADAPTER API provides access to query the data once it is distributed.
Further details of the design and implementation are available in the documents below.
Reports, Presentations and Publications:
Initial design report in pdf. (March 2008)
Detailed kd-tree design diagram in pdf. (August 2008)
Parallel kd-tree design overview in pdf. (September 2008)
A presentation and demonstration of ADAPTER was given to cosmologists at the international VIRGO consortium workshop held at the Max-Planck-Instiut fur Astrophysik in Garching 28th January 2009. Slides are here in PDF.
A presentation and demonstration of ADAPTER was given to computer scientists in the Durham Interactive Media Technology group meeting 19th May 2009.
Software Availability:
The ADAPTER software is available freely for reuse. However, please note it will require technical expertise to install, compile and run the software, it is built using the MPI distributed processing libraries. Please email the PI below for details of how to receive the software.
Administrative Arrangements:
The ADAPTER project ran from October 2007 until March 2009 and was funded by the EPSRC as grant number EPSRC EP F01094XThe Principal Investigator was Dr N.S. Holliman, the Co-Investigators were Dr A. Jenkins and Dr T. Theuns.
The Research Associate who was employed for the duration of the project to undertake the research was Djamel Hassaine.