[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
all tests setup - revised
This vesrion contains the additional 3 test performed on Star on Oct 5th.
Below is a summary of the tests we ran BNL. Each test has a summary
of the setup followed by comments about the outcome. The comments
include some preliminary observations that need to be confirmed by
the logs.
All the logs, queries and other info collected for each run are in
directories /grandch/u/GCA/bin/RUN_TEST/results and
/grandch/u/GCA/bin/phenix/results, for star and phenix
correspondingly.
The setup summary uses the following notation:
Fed: means which federation the test was run on, star or phenix.
SII/UC/STAF: means whether the test was run om the User Code simulator (SII),
the user code (UC) which includes the event iterators and Objectivity,
and STAF means running an analysis program on STAF.
Cache size: is the test cache size selected to reflect a scale down
in the ratio of the expected real data to the amount of simulation
data available at the time of the test.
Query: specifies which query was run; the query is found in the
corrsponding results directory.
Proc.time: is the time we set for processing each event for each test.
It was used to simulate different analysis complexity (e.g. 20-30 sec
for event-by-event).
Policy: yes - we use the current caching policy that the Query Monitor
use which coordinates the use of files among queries.
No - we use the non-coordinating setup of requesting file in the
order extracted from the index.
Cycle: means how many times we ran the query (or collection of
queries) in the test. For example in TEST 1 below we ran four queries
in aa given order and then cycles over then 10 times. A time delay
was introduced between queries according to the specification in the
query file.
TEST 1
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 1 GB
Query: queres.all4 (four queries)
Proc.time: 0.01 (except one)
Policy: yes
Cycle: 10
Comments:
1. This run was run overnight as a robustness
test for entire system including UC. It ran
for 10 hrs, and completed.
2. Note: Yuri ran his jobs during the last 3 hours.
TEST 3
--------------------
Fed: star
SII/UC/STAF: SII
Cache_size: 2 GB
Query: query.q1.twice
Proc.time: 0.01
Policy: no
Cycle: 1
Comments:
1. This test was an attempt to run the same query
twice with a 30 min time delay without caching
coordination. The purpose was to show that no
coordination causes many files to be read twice.
2. This is when we noticed that the time we are
observing included wait time until drive was
available. So we decided to serialize the PFTP
queries in the next test.
TEST 3
--------------------
Fed: star
SII/UC/STAF: SII
Cache_size: 1 GB
Query: query.q1, delay, tehn query.q1
Proc.time: 0.01
Policy: no
Cycle: 1
Comments:
1. This test was a second attempt to run the same query
twice with time delay without caching
coordination.
2. We ran this test serially scheduling PFTPs.
This is when we noticed that the tape was removed and
mounted for every file, even if the file is on the same
tape. We later learned that this is because the dismount
time on hpss was set to 15 sec. So serializing PFTP was
bad since there was no pending PFTP from the same tape,
and it dismounted unnecessarily.
3. We decided to abandon serialization of PFTPs.
TEST 4
--------------------
Fed: star
SII/UC/STAF: SII
Cache_size: 1 GB
Query: query.cluster
Proc.time: 10
Policy: yes
Cycle: 1
Comments:
1. This test was made to show the benefits of clustering.
We made a query that will hold the same number of event
as query1, but in 4 files only. The benefit was several
fold speedup, even when the processing time per event
is large (10 sec), and thus making the processing time
dominate the caching time.
2. We noticed that even with one PFTP pending (i.e. one cache
ahead request) we still got tapes dismounted unnecessarily
even if the next file was on the same tape. We think this
happens if the transfer time between the hpss cache and local
cache is longer than transfer time of the next PFTP file to
hpss cache. We verified that if we have many (more than 2)
PFTPs pending from the same tape, the tape does not dismount.
TEST 5
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query6 (six queries)
Proc.time: 20 (except one)
Policy: yes
Cycle: 1
Comments:
1. This run was run overnight as a robustness
test. All 6 queries provided by Dave Z.
completed.
TEST 6
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query7
Proc.time: 20 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the first part of testing the
effect of clustering. This run is not clustered.
2. HPSS_cache did not empty - this test is not useful
for measurement, but it shows that system ran OK.
3. Query7 is a modified Query2 to have more
events (131 instead of 66)
4. We asked for purge policy to be changed to:
purge within 5 min, when cache is 1% full (to guarantee
that hpss_cache empties.)
TEST 7
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query7
Proc.time: 20 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the first part of testing the
effect of clustering. This run is not clustered.
This test is valid.
2. This is the same as the previous run, but
this time hpss_cache was properly emptied.
3. Query7 is a modified Query2 to have more
events (131 instead of 66)
4. Note: This test should be dominated by caching
time, even though proc_time is large, because
there are few relevant events per file.
5. Note: Total time was about 2 hrs (check)
6. Note: We observed files tapes 43 and 44 switching
several times, inducing long delays.
e.g. switch between files 51 and 172, then
again on the next file. Also at 11:31 switch to 44,
and 11:33 switch back to 43.
7. Note: 2 files were missing, 103 and another (see log).
TEST 8
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query8
Proc.time: 20 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the second part of testing the
effect of clustering. This run is clustered.
This test is valid.
2. The query used here was selected to have the
same number of events as the previous query
(131 events) but were concentrated in 4 files
(see queries.stat for distribution)
3. Note: This test should be dominated by processing
time, because there are about 33 events per file
on the average, and processing time is 20 sec/event
(or about 600 sec per file).
4. Note: Total time was about 40 min, (check log)
as opposed to 2 hours for the unclustered case.
TEST 9
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 1 GB
Query: query8
Proc.time: 1 sec
Policy: yes
Cycle: 1
Comments:
1. This run is the third part of testing the
effect of clustering. This run is clustered,
but the processing time was set to 1 second.
This test is valid.
2. As in the previous run,
the query used here was selected to have the
same number of events as the previous query
(131 events) but were concentrated in 4 files
(see queries.stat for distribution)
3. Note: This test should be dominated by cache
time, because there are about 33 events per file
on the average, but processing time is only 1
sec/event (or about 33 sec per file).
4. Note: Total time was about 15 min (check log)
as opposed to 40 min for the previous run
(with proc. time 20 sec) and as opposed to
2 hours for the unclustered case.
TEST 10
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 250 MB
Query: query8
Proc.time: 1 sec
Policy: no
Cycle: 1
Comments:
1. This test was not run to completion.
2. This run was supposed to be the first part of
a test to show the value of caching coordination.
It was set up as the same query run twice with a
delay of 10 min. The idea is that by the time the
second query starts the first file is removed from
the cache. Since there is no policy, the files
will not be synchronized, and will be cached twice.
3. This test got as far as asking to cache the first
file a second time, and then the QM got stuck because
of a race condition that was not anticipated.
4. The second part of this test was not conducted.
Alex was contacted, and he sent a fix. The plan is
to run this test again on Oct 5th (tomorrow) on
when we switch over to STAR.
TEST 11
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 2 GB
Query: query10
Proc.time: 0.01 sec
Policy: yes
Cycle: 1
Comments:
1. The purpose of this test was to check the
efficiency of accessing objectivity. It was
setup as 3 steps:
a) Make the cache size large enough to hold
10 file, and cache them.
Selected files and sizes were (sizes provided
by Dave Z. in parenthesis):
98(194), 94(194), 71(182), 57(173), 39(156)
73(128), 72(148), 58(155), 47(102), 41(119)
In total the query seleced 500 events out of
these files.
b) Run a test with SII only, running this query
10 times, were processing time is very small.
No caching takes place since all the files were
left in the cache. Thus 5000 events are "retrieved".
c) Run the same test with UC. This will make objectivity
get 5000 events from 10 files.
2. This run is only the cache loading setup.
It ran successfully.
TEST 12
--------------------
Fed: Phenix
SII/UC/STAF: SII
Cache_size: 2 GB
Query: query10
Proc.time: 0.01 sec
Policy: yes
Cycle: 10
Comments:
1. This is step b) of the test, running SII -
see previous run
2. Processing time was 32 sec (as expected)
since all the files were in cache.
TEST 13
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 2 GB
Query: query10
Proc.time: 0.01 sec
Policy: yes
Cycle: 10
Comments:
1. This is step c) of the test with UC (i.e.
Objectivity) - see previous runs
2. Processing time was 130 sec
since all the files were in cache.
3. This is only 100 sec to access 5000 events
(pretty good!).
4. A more interesting test will be with UC running
on a linux machine over the net. Then we will test
the cost of transfer time over the net.
TEST 14
--------------------
Fed: Phenix
SII/UC/STAF: STAF
Cache_size: 1 GB
Query: query7
Proc.time: determined by UC,
estimated 30 sec/event
Policy: yes
Cycle: 10
Comments:
1. This was a "robustness" test for STAF codes.
2. HPSS came down in the middle of the test,
and a cetain file could not be PFTP'd. The CM
and QM repeatedly issued the query, until HPSS
came back up, then continued properly.
3. Test was completed.
TEST 15
--------------------
Fed: Phenix
SII/UC/STAF: UC
Cache_size: 2 GB
Query: Queries11 (3 queries), then each individually.
Proc.time: 1
Policy: yes
Cycle: 1
Comments:
1. This was a combined test to be run overnight
to test caching coordination effectiveness.
3 queries were chosen so that they mutually
overlap by 50% in terms of files they use.
(each accesses 8 file. each pair has 4 overlapping
files).
2. First all three were run together. Then each was
run individually.
TEST 16
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 350 MB
Query: Quer12.txt (1 query, run twice with a delay of 15 min.)
Proc.time: 1
Policy: no
Cycle: 1
Comments:
1. Did not run to completion.
2. The second query started just when the QM hung because CM
returned No Space left. This is the same problem as in Run2.
TEST 17
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 350 MB
Query: Quer12.txt (1 query, run twice with a delay of 15 min.)
Proc.time: 1
Policy: yes
Cycle: 1
Comments:
1. Did not run to completion, but might have run long enough
to prove the point, i.e., that the policy is useful.
2. The reason it didn't complete was that, when HPSS returned
Error 16 (Mount device busy) and staging of a file failed,
the QM stopped requesting files to be cached. It should
have asked for the next file in it's list.
The local cache was empty in the end.
TEST 18
--------------------
Fed: star
SII/UC/STAF: UC
Cache_size: 350 MB
Query: Quer12.txt (1 query, run twice with a delay of 10 min.)
Proc.time: 1
Policy: no
Cycle: 1
Comments:
1. This is the same test as the previous one, except that
the cache coordination policy was turned off. Thus, the
query was run twice with a delay of 10 min, but the file
request order was the same in both, rather than according
what was in the cache.
2. This test ran successfully, showing that essentially all
files were read twice since there was no cache coordination.