There was agreement on several principles that guide the architecture design:

Architectural Issues in the HENP Grand Challenge Project

Summary by Arie Shoshani

Introduction

In the meeting that was held at LBNL on June 30^th and July 31^st, 1997, there were several issues that concerned the architecture of the system to support efficient analysis of the event data.

The main issues were:

How would new application written in C++ interface to the system?
How do we provide an environment that can support legacy analysis programs (written in Fortran and C) as well as future analysis codes written in C++?
How would the physicist/data administrator interact with the system to control the layout of the event data on tape and disk cache to optimize the system’s performance?

Principles for the architecture design

There was agreement on several principles that guide the architecture design:

There needs to be an exploration phase of the micro-dataset size requested before the analysis program is launched. During this phase the analyst expresses a potential query in terms of range conditions on desired properties (e.g. "all events with 2-3 GEV and more than 5 pions). A size estimator responds with the estimate of how much data needs to be cached for this query. In addition, at any one time, a time estimator of the time needed to transfer the data to disk cache is estimated based on what portions of the data are in the cache already (if any). The analyst can then modify the query if necessary and when satisfied launch the analysis program.

It is unimportant to the analysis program in which order the events are provided. Thus, the cache manager can choose to provide the analysis program events in the order most beneficial to the system as a whole. In particular, any event data that is already in cache should be made available to the analysis program immediately.

Events should be streamed to the analysis program as soon as they are cached. In this way, in case that the system is heavily loaded, in the worse case the time to get a request to fulfill will be one full path over the entire event data on tape. Bruce Gibbard estimated that for STAR the size of the primary data per year is about 23 TB, and a full reading cycle would take about 13 days (at a rate of about 20 MB/s). When the system is heavily loaded it is necessary to stream the available data to the analysis application in order to permit sharing of data during a single cycle. Of course, it is hoped that the system will not be totally saturated, and that the cache manager will be able to optimize for a moderate request load.

Because of the potentially long time to get even a single event in a shared system, it is too inefficient to fulfill "spontaneous" requests for data by the analysis program. Thus, the concept of an analysis program examining some data and formulating new queries "on the fly" (e.g. using D-ref structures) is not likely to work efficiently. Rather, the normal operation should be an exploration phase, a launching phase where the requested data is cached. The data is then steamed to the analysis program in a "get next" mode.

The architecture should support both ‘legacy" analysis codes (written in Fortran or C) for data stored in a standard format with a standard content. In addition, it should support new code written in C++.

Architecture

According to the above principles, the architecture shown in figure 1 was adopted (it is an adaptation of an architecture diagram provided by Bill Johnston). It has the following main components. (i) An application environment that can support a variety of codes (including Fortran and C legacy codes). (ii) A storage manager whose responsibility is to keep track and manage the data on tape and on cache. (iii) A data mover that is responsible for moving the data from tape to disk cache according to instructions from the storage manager (currently, we assume it is the data mover of HPSS). (iv) The query formatter that interacts with user and programs to provide size and time estimates. (v) The job control module whose reponsibility is to look at the jobs requests from a global point of view and optimize the request stream to the storage manager.

Below is a more detailed discussion of each of these components and their functional requirements.

The application environment

This is the environment where analysis programs

are launched. Since this environment needs to support legacy codes as well as new C++ codes, there are several "versions" for the support of the different programs, referred to as "object presenter 1", "object presenter 2", etc. Currently, we plan to use and extend the STAF facility.

2. The storage manager

There are two parts for the storage manager: the data loading part and the data access part.

The data loading part includes cluster analysis and data reorganization . The cluster analysis module responsibility is to identify the best way to cluster the event data. The data reorganization module is responsible for restructuring the event data according to the desired clustering.

The data access part is the one shown in figure 1, and it includes the exploration module and cache management. The exploration module provides size and time estimates for a given query. The cache management module determines which files to move from tape to the disk cache and in what order. The exploration module interacts with the query formatter, and the cache management module interacts with the job control module.

Hardware scenarios

We are proceeding with 2 scenarios in terms of where the disk cache should reside. In the first scenario the disk cache is a RAID under HPSS. 200GB of dedicated RAID will be provided to the GC experimental prototypes. The second scenario is an external cache (specifically DPSS) where the data mover under HPSS will be requested to stream the data to that cache.

The reason for the two scenarios is to learn the cost and efficiency tradeoffs of running a RAID under HPSS (providing about 50 MB/s per channel) and using an external distributed parallel cache that provides full control over the placement of the data.

Open issues

Data formats

An open problem is the issue of data formats. The traditional legacy codes use a table format to represent the event and particle data. On the other hand, if one chooses some object-oriented database system (OODBMS) to manage the data, then the data needs to be loaded and stored by that system in its internal data formats (i.e. persistent C++ structures). These data formats will not work with legacy software. Similarly, if the data is stored in traditional table formats, they cannot be readily used by C++ programs. Since the data is too voluminous to store in both formats, it is necessary to make a choice of the primary format, and convert to the secondary format as needed by the analysis program.

The approach taken by STAR is leaning towards the second option, which is to store the data in table formats so that legacy software (currently about half a million lines of code in STAF) continues to work. For new C++ codes, the data will be loaded from the table formats into C++ structures. The C++ analysis program can then interact with an object-oriented database system to store the data they generate.

Use of an OODBMS to manage storage

There is much interest in potentially using an OODBMS to store and manage the data on tertiary storage. The attraction in that is that C++ codes interact with this system, and have the flexibility to organize data (including clustering of objects) as is best for the analysis codes. In particular, the OODBMS Objectivity was chosen to be interfaced to HPSS, an approach that BABAR and other projects are pursuing.

There are several open issues with this approach. Essentially, it is necessary to have Objectivity perform all the functions currently planned for the storage manager. Below are the specifics.

So far, there are only plans for interfacing Objectivity to HPSS on the file level. That is, Objectivity will identify if a file is on tape, and if so will cache it to disk. There are no plans to provide information for the data organization and clustering on tapes to optimize usage. Thus, this function of the storage manager will have to be provided by the system. In preliminary discussions, it was suggested that perhaps this could be provided by using the "container" capability in Objectivity.
Objectivity will have to provide an efficient index to support the "exploration phase" where estimates on the size of the data as well as time to cache the data are provided.
The number of object ID necessary for this application is in the order of 10¹¹. Currently, objectivity cannot support that.
Since the data will be managed by Objectivity, it is necessary to load all the data into it. There are no estimates at this time of the cost of loading TBs of data tabular formats into Objectivity.
Objectivity will have to provide cache management. It will have to either develop this capability or incorporate the methods which optimize access for event data.
Data stored in Objectivity will have to be converted to table formats for legacy codes to work. It is unknown how expensive this process is.
The cache supported by objectivity will have to be highly parallel. It could be either managed by HPSS, or directly by Objectivity outside of HPSS. If so, the question of whether Objectivity will provide moving of data to an external disk cache (such as DPSS) is still open. In that case how would objectivity interact with an external disk cache to run the C++ analysis programs is not resolved. An alternative is for Objectivity to use a parallel disk cache which it controls. Whether this is feasible is also an open question at this time.

Change log:

14 Oct 1997, remove a reference to add more material. D.O.