[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MDC1-GC:lessons learned



 

-- BEGIN included message

Arie Shoshani wrote:
> 
> 2.  Long dismount/mount times
> 
> Observation: it takes on the average 2 min to dismount and mount a tape.
> Reason: it takes 90 sec to rewind a full tape, and 90 sec to seek to end
> of tape.  Mount/dismount is about 17 sec.  We observed longer times.
> Lesson: this confirms that it is important to schedule multiple reads
> out of the same tape when possible.
> Implication: we need to find a way to get tape IDs dynamically and use
> that in the caching policy.  Currently this means having a client API to
> HPSS.

There is an other method we can use. David Z. got the information from
Oakridge. Luis knows more about it, I think, but there is a command that
lets you know what files are on each tape, and it will also give you the
order of the files on the tape. This can be very useful to us.
 
> 3. Transfer rate between caches
> 
> Observation: most of the time we got about 2 MB/s.  Sometimes we got as
> much as 5 or 6, but we observed often .5 to 1 MB/s.
> Reason: network is shared.

Are you sure it's not a configuration problem? I don't think we got
rates at 20kB/s becuase the network was shared.

> Lesson: the transfer rate between HPSS cache and local cache can
> dominate.  Even if we avoid dismounts, and transfer from tape to
> hpss_cache is fast, the effective transfer rate is determined by the
> transfer rate between the caches.
> Implication: caching ahead into the local cache can be beneficial.  But
> if processing time per file is long, it should be limited to 1.
> 
> 4.  HPSS misbehavior
> 
> Observation: we got several unexpected errors from HPSS, such as "can't
> mount tape" (error 17), "path name too long", etc.  We also had HPSS

"device busy" is error 16. We also got error 5 (I/O Error) several
times.

> malfunction and fixes during test runs.
> Lesson: we need to be prepared for all such behavior.  In one test we
> resumed operation automatically after HPSS malfunction by periodically
> requesting transfer of the last file requested.
> Implication: we need to decide what is appropriate for the Storage
> Manager to do if we get errors from HPSS and Objectivity.
> 
> 8.  Objectivity overhead appears low
> 
> Observation: observed overhead for getting 5000 events from objectivity
> was only 100 sec.  However the test was made with "minimal user code",
> and the user code was on the same machine as Objectivity.
> Lesson: we should plan tests were the user code is on a PC-Linux over a
> network to see the effect.

Don't forget that there only were 500 unique events and Objy may take
advantage of caching.
I do not think we should go out and say (especially not to Objy) that
Objy adds no overhead. For very small queries, it seems that it doesn't,
but we need to do stress testing to speak about the general case.
I just don't want people to quote us on this one, and get the wrong
idea.

 - Henrik 
_________________________________________
Henrik Nordberg       <hnordberg@lbl.gov>
Scientific Data Management Research Group
Lawrence Berkeley National Laboratory

-- END included message