[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How to: build the index



It is now easier to build the Storage Manager index (I hope). There is only
one script to run, although you have to create a file with names of components
and their paths in HPSS. You also need to copy a few other scripts into the
directory in which you are creating your index. All scripts are available
in CVS/QE/index/size. You also need mksize and mkoids in your path. The easiest
way is probably to check out QE, make all and add QE/bin to your path.

 - Henrik

This is from CVS/QE/README:

4 Building the index
====================

* Create a flat ASCII file using TagFlatIndex (by David Zimmerman).

* Create a file called component.hpss.paths with two columns:
  <component name> <path on HPSS>

* create the directory where you want the index and copy the following
  files there: component.hpss.paths, ftp.csh, assoc.awk, parse_dump.awk

* run mkindex.csh <fdboot file name> <Tag_Flat_Index.data> in the
  the directory you created in the step above

* That's it. You need to let the Query Estimator know where the index
  is (see below).

4.1 Example of building the index
=================================

This is an example of what component.hpss.paths can contain:
dbComp0 /home/rhstar/olson/objy/db_30007
dbComp1 /home/rhstar/olson/objy/db_30007
dbComp2 /home/rhstar/olson/objy/db_30007
dbComp3 /home/rhstar/olson/objy/db_30007
dbComp4 /home/rhstar/olson/objy/db_30007

Assuming the following commands are in your path (if they aren't you
need to check out the QE source from cvs and make them): ascii2vert,
mkoids, mkbmp. We also assume that you are in a directory containing
Tag_Flat_Index.data (the ASCII file generated by TagFlatIndex) and 
that you want the index created in ./index. The output of the commands
has been omitted:

mkdir index
vi component.hpss.paths
mkindex.csh /common/GC/dbases/fd30007/hij.fdboot ../Tag_Flat_Index.data

That's it. Specify the following in gc.config, if CWD=/grandch/u/GCA:
qe*IndexDirectory=/grandch/u/GCA/index
qe*DataSetDefFile=/grandch/u/GCA/index/bin_spec.tdc

4.2 Details of building the index
=================================

To build the index "by hand", i.e., without using mkindex.csh:

* Run ascii2vert on that ASCII file:
ascii2vert Tag_Flat_Index.data <index>
where <index> is the name of the directory where you want to create the
index. ascii2vert produces one p_<prop>.bin file per property, where 
<prop> is the name of that property as specified in the first line of
Tag_Flat_Index.data. This is the name that can be used in queries to 
the Query Estimator (once you have completed these steps of building 
the index). 
output: 
o one p_<prop>.bin per property <prop>
o bin_spec.tdc - a text file describing the properties, their types and
                 the way they are binned
o components.txt - a text file with the names of the components that
                   the Storage Manager will know about
o names.txt - a text file with the names of the properties in the index

* Change directory to <index>

* Run mkoids in that directory. mkoids takes no arguments. It creates
  oids.bin which is a mapping to OIDs in the index.

You now have something that we call a Full Scan Index.

* Create the Bitmap Index from the Full Scan Index:
run mkbmp:
mkbmp <bin_spec.tdc> <index> [maximum number worker threads]
where <bin_spec.tdc> is the name of the bin_spec.tdc file that was 
generated by ascii2vert and <index> is the directory containing the
Full Scan Index. A third argument may be specified to limit the 
number of threads that mkbmp should use. If you do not specify 
this parameter, make sure you have a high soft limit of descriptors
(type `limit` to find out). For tcsh, `limit descriptors 1024` will
let you have 1024 descriptors (files, sockets etc.) open at the same
time. mkbmp opens one file per bin per property if you don't limit 
the number of threads.

Hint: if mkbmp complains that it can't open certain files, either set
the soft limit for descriptors to a higher value or limit the number 
of threads to 10 or so, by specifying the third argument to mkbmp.

mkbmp outputs one file per bin per property. The naming convention is
idx_<prop>bN.bin for bin N of property <prop>.

* Create file size index:
cp components.txt component.hpss.paths
add the HPSS path for each component in component.hpss.paths
run get_sizes.csh:
./get_sizes.csh <fdboot file path>
for each component:
	run mksize p_<component name>.bin

Example detailed run:

ascii2vert Tag_Flat_Index.data index
cd index
mkoids
mkbmp bin_spec.tdc . 10
cp components.txt component.hpss.paths
vi component.hpss.paths # add HPSS paths
./get_sizes.csh /disk0/grandch/stardb/fd30025/STAR
mksize p_tracks.bin
mksize p_an_other_component.bin
mksize p_yet_an_other_component.bin

 
_________________________________________
Henrik Nordberg       <hnordberg@lbl.gov>
Scientific Data Management Research Group
Lawrence Berkeley National Laboratory