[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

all tests setup - revised




This vesrion contains the additional 3 test performed on Star on Oct 5th.

Below is a summary of the tests we ran BNL.  Each test has a summary
of the setup followed by comments about the outcome.  The comments
include some preliminary observations that need to be confirmed by
the logs.

All the logs, queries and other info collected for each run are in 
directories /grandch/u/GCA/bin/RUN_TEST/results and
/grandch/u/GCA/bin/phenix/results, for star and phenix
correspondingly.  

The setup summary uses the following notation:

Fed: means which federation the test was run on, star or phenix.

SII/UC/STAF: means whether the test was run om the User Code simulator (SII),
the user code (UC) which includes the event iterators and Objectivity,
and STAF means running an analysis program on STAF.

Cache size: is the test cache size selected to reflect a scale down
in the ratio of the expected real data to the amount of simulation
data available at the time of the test.

Query: specifies which query was run; the query is found in the
corrsponding results directory.

Proc.time: is the time we set for processing each event for each test.  
It was used to simulate different analysis complexity (e.g. 20-30 sec
for event-by-event).

Policy: yes - we use the current caching policy that the Query Monitor
use which coordinates the use of files among queries.
No - we use the non-coordinating setup of requesting file in the
order extracted from the index.

Cycle: means how many times we ran the query (or collection of
queries) in the test.  For example in TEST 1 below we ran four queries
in aa given order and then cycles over then 10 times.  A time delay
was introduced between queries according to the specification in the
query file.

TEST 1
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 1 GB
Query: queres.all4 (four queries)
Proc.time: 0.01 (except one)
Policy: yes
Cycle: 10
Comments:
1. This run was run overnight as a robustness
   test for entire system including UC.  It ran 
   for 10 hrs, and completed.
2. Note: Yuri ran his jobs during the last 3 hours.

TEST 3
--------------------
Fed: star
SII/UC/STAF: SII
Cache_size: 2 GB
Query: query.q1.twice
Proc.time: 0.01 
Policy: no
Cycle: 1
Comments:
1. This test was an attempt to run the same query
   twice with a 30 min time delay without caching
   coordination.  The purpose was to show that no
   coordination causes many files to be read twice.
2. This is when we noticed that the time we are 
   observing included wait time until drive was 
   available.  So we decided to serialize the PFTP
   queries in the next test.

TEST 3
--------------------
Fed: star
SII/UC/STAF: SII
Cache_size: 1 GB
Query: query.q1, delay, tehn query.q1
Proc.time: 0.01 
Policy: no
Cycle: 1
Comments:
1. This test was a second attempt to run the same query
   twice with time delay without caching
   coordination.  
2. We ran this test serially scheduling PFTPs.
   This is when we noticed that the tape was removed and
   mounted for every file, even if the file is on the same
   tape.  We later learned that this is because the dismount
   time on hpss was set to 15 sec.  So serializing PFTP was
   bad since there was no pending PFTP from the same tape,
   and it dismounted unnecessarily.
3. We decided to abandon serialization of PFTPs.

TEST 4
--------------------
Fed: star
SII/UC/STAF: SII
Cache_size: 1 GB
Query: query.cluster
Proc.time: 10
Policy: yes
Cycle: 1
Comments:
1. This test was made to show the benefits of clustering.
   We made a query that will hold the same number of event
   as query1, but in 4 files only.  The benefit was several
   fold speedup, even when the processing time per event
   is large (10 sec), and thus making the processing time
   dominate the caching time.
2. We noticed that even with one PFTP pending (i.e. one cache
   ahead request) we still got tapes dismounted unnecessarily
   even if the next file was on the same tape.  We think this 
   happens if the transfer time between the hpss cache and local
   cache is longer than transfer time of the next PFTP file to
   hpss cache.  We verified that if we have many (more than 2)
   PFTPs pending from the same tape, the tape does not dismount.
   

TEST 5
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query6 (six queries)
Proc.time: 20 (except one)
Policy: yes
Cycle: 1
Comments:
1. This run was run overnight as a robustness
   test.  All 6 queries provided by Dave Z.
   completed.

TEST 6
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query7
Proc.time: 20 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the first part of testing the
   effect of clustering.  This run is not clustered.
2. HPSS_cache did not empty - this test is not useful
   for measurement, but it shows that system ran OK.
3. Query7 is a modified Query2 to have more 
   events (131 instead of 66)
4. We asked for purge policy to be changed to:
   purge within 5 min, when cache is 1% full (to guarantee
   that hpss_cache empties.)

TEST 7
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query7
Proc.time: 20 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the first part of testing the
   effect of clustering.  This run is not clustered.
   This test is valid.
2. This is the same as the previous run, but 
   this time hpss_cache was properly emptied.
3. Query7 is a modified Query2 to have more 
   events (131 instead of 66)
4. Note: This test should be dominated by caching
   time, even though proc_time is large, because 
   there are few relevant events per file.
5. Note: Total time was about 2 hrs (check)
6. Note: We observed files tapes 43 and 44 switching
   several times, inducing long delays.
   e.g. switch between files 51 and 172, then
   again on the next file. Also at 11:31 switch to 44,
   and 11:33 switch back to 43.
7. Note: 2 files were missing, 103 and another (see log).

TEST 8
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query8
Proc.time: 20 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the second part of testing the
   effect of clustering.  This run is clustered.
   This test is valid.
2. The query used here was selected to have the
   same number of events as the previous query
   (131 events) but were concentrated in 4 files
   (see queries.stat for distribution)
3. Note: This test should be dominated by processing
   time, because there are about 33 events per file
   on the average, and processing time is 20 sec/event
   (or about 600 sec per file). 
4. Note: Total time was about 40 min, (check log)
   as opposed to 2 hours for the unclustered case.

TEST 9
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query8
Proc.time: 1 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the third part of testing the
   effect of clustering.  This run is clustered,
   but the processing time was set to 1 second.
   This test is valid.
2. As in the previous run,
   the query used here was selected to have the
   same number of events as the previous query
   (131 events) but were concentrated in 4 files
   (see queries.stat for distribution)
3. Note: This test should be dominated by cache
   time, because there are about 33 events per file
   on the average, but processing time is only 1 
   sec/event (or about 33 sec per file). 
4. Note: Total time was about 15 min (check log)
   as opposed to 40 min for the previous run 
   (with proc. time 20 sec) and as opposed to
   2 hours for the unclustered case.

TEST 10
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 250 MB
Query: query8
Proc.time: 1 sec
Policy: no
Cycle: 1
Comments:
1. This test was not run to completion.
2. This run was supposed to be the first part of
   a test to show the value of caching coordination.
   It was set up as the same query run twice with a
   delay of 10 min.  The idea is that by the time the
   second query starts the first file is removed from 
   the cache.  Since there is no policy, the files 
   will not be synchronized, and will be cached twice.
3. This test got as far as asking to cache the first 
   file a second time, and then the QM got stuck because
   of a race condition that was not anticipated.
4. The second part of this test was not conducted.
   Alex was contacted, and he sent a fix.  The plan is
   to run this test again on Oct 5th (tomorrow) on 
   when we switch over to STAR.

TEST 11
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 2 GB
Query: query10
Proc.time: 0.01 sec
Policy: yes
Cycle: 1
Comments:
1. The purpose of this test was to check the 
   efficiency of accessing objectivity.  It was
   setup as 3 steps: 
   a) Make the cache size large enough to hold 
      10 file, and cache them.
      Selected files and sizes were (sizes provided
      by Dave Z. in parenthesis):
       98(194), 94(194), 71(182), 57(173), 39(156)
       73(128), 72(148), 58(155), 47(102), 41(119)
      In total the query seleced 500 events out of
      these files.
   b) Run a test with SII only, running this query
      10 times, were processing time is very small.
      No caching takes place since all the files were
      left in the cache.  Thus 5000 events are "retrieved".
   c) Run the same test with UC.  This will make objectivity
      get 5000 events from 10 files.
2.  This run is only the cache loading setup.  
    It ran successfully.

TEST 12
--------------------
Fed: Phenix
SII/UC/STAF: SII
Cache_size: 2 GB
Query: query10
Proc.time: 0.01 sec
Policy: yes
Cycle: 10
Comments:
1. This is step b) of the test, running SII - 
   see previous run
2. Processing time was 32 sec (as expected)
   since all the files were in cache.

TEST 13
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 2 GB
Query: query10
Proc.time: 0.01 sec
Policy: yes
Cycle: 10
Comments:
1. This is step c) of the test with UC (i.e. 
   Objectivity) - see previous runs
2. Processing time was 130 sec 
   since all the files were in cache.
3. This is only 100 sec to access 5000 events
   (pretty good!).
4. A more interesting test will be with UC running
   on a linux machine over the net.  Then we will test
   the cost of transfer time over the net.

TEST 14
--------------------
Fed: Phenix
SII/UC/STAF: STAF
Cache_size: 1 GB
Query: query7
Proc.time: determined by UC,
           estimated 30 sec/event
Policy: yes
Cycle: 10
Comments:
1. This was a "robustness" test for STAF codes.
2. HPSS came down in the middle of the test,
   and a cetain file could not be PFTP'd.  The CM
   and QM repeatedly issued the query, until HPSS
   came back up, then continued properly.
3. Test was completed.

TEST 15
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 2 GB
Query: Queries11 (3 queries), then each individually.
Proc.time: 1
Policy: yes
Cycle: 1
Comments:
1. This was a combined test to be run overnight
   to test caching coordination effectiveness.
   3 queries were chosen so that they mutually
   overlap by 50% in terms of files they use.
   (each accesses 8 file.  each pair has 4 overlapping
    files).
2. First all three were run together.  Then each was
   run individually.

TEST 16
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 350 MB
Query: Quer12.txt (1 query, run twice with a delay of 15 min.)
Proc.time: 1
Policy: no
Cycle: 1
Comments:
1.	Did not run to completion.
2.	The second query started just when the QM hung because CM
	returned No Space left. This is the same problem as in Run2.

TEST 17
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 350 MB
Query: Quer12.txt (1 query, run twice with a delay of 15 min.)
Proc.time: 1
Policy: yes
Cycle: 1
Comments:
1.	Did not run to completion, but might have run long enough
	to prove the point, i.e., that the policy is useful.
2.	The reason it didn't complete was that, when HPSS returned
	Error 16 (Mount device busy) and staging of a file failed,
	the QM stopped requesting files to be cached. It should
	have asked for the next file in it's list.
	The local cache was empty in the end.


TEST 18
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 350 MB
Query: Quer12.txt (1 query, run twice with a delay of 10 min.)
Proc.time: 1
Policy: no
Cycle: 1
Comments:
1. This is the same test as the previous one, except that 
   the cache coordination policy was turned off.  Thus, the
   query was run twice with a delay of 10 min, but the file
   request order was the same in both, rather than according
   what was in the cache.
2. This test ran successfully, showing that essentially all
   files were read twice since there was no cache coordination.