next up previous contents
Next: Bibliography Up: Grid Progress and Usage Previous: Finding File Replicas   Contents

Submission Problems

In this section, I try to report all the problems I met during job submmission of production jobs. Due to the grid complexity and the amount of network transfers, possible failures may take place in any point of the processing chain. As grid computing is a highly dynamical task with a big amount of temporary files and areas that must be promptly cleaned up to free space to newly submitted job, problem debugging is a very complex tasks as information may be deleted while trying to figure out the problem.

Production jobs are usually resource bound. In my case the jobs required a very long CPU time ($>15$ hours), more than 400 Mbytes memory, nearly 2Gb disk space and MSS storage access. Only very few CEs offered such resources, thus limiting site access with the net results that jobs were usually queued on overloaded CEs. Taking also into account that at least 20-30 jobs represented the average daily submission, also the available RBs were overloaded.

The RB checks that there is enough space left for Input/Output Sandbox files. If space is exausted in the WN file space, there is no booking tool to reserve space in advance .... File space granularity ??? Space handling problem....

When a job is submitted to the Grid, the input files defined in the JDL InputSandbox files are migrated from the UI to the computing WN through the RB using globus-url-copy. If this operation fails, the job submission aborts with a jobwrapper message:
Failure while executing job wrapper.
The problem is usually due to the RB resource saturation (sandbox space exhausted).

As RB temporary space is shared among jobs and is of limited size, do not direct to sandbox areas big files. The sandbox areas are designed to hold submitting scripts, log and error files of limited size. Bigger I/O files should be stored in the SE. As at present I/O is performed from the local disk space of the WN, temporary space might be a problem for big output files.

Common problems are:

More informations on job submission is given by dg-job-get-logging-info.

Possible explanations about job wrapper error are given below:

The standard output of the script where the user job is wrapped
around (and which also includes the transfer of the input and output
sandboxes) and which is transferred via Globus GASS from the CE node to
the RB machine in order to check if the job was successfully executed.

Many reasons for this problem were found and addressed, e.g.:

- Globus GASS cache problems (fixed via a patch provided by the Condor
  team)

- Exhausted resources on the CE head node: sysadmins must take care to
  properly set some sysctl parameters

- Race conditions for file updates between the worker node and the
  CE node: fixed via a workaround implemented in the jobmanager by WP1

- PBS doesn't remember, through the "qstat" interface, about
  terminated jobs and Globus considers a job that is not found by "qstat"
  as a "done" job: addressed via a new globus pbs script provied by WP1

Besides the above mentioned fixes/workarounds, in order to reduce
the rate of failures, a modification was introduced in the WMS software
(the so-called JSS-Maradona), in order to transfer the standard output of
job wrappers also via gridftp.

The error message so means that the standard output of job wrapper, in
spite of all these fixes/workarounds, was not available.
There could be various reasons, for example the job was dispatched to a
WN where the home directory was not accessible

In case of gatekeeper problems, the job fail with a generic Globus Failure message:

Status                  =    Aborted
Status Reason           =    Globus Failure:

In case of some strange error in storing file on the RC, you may get a message of the following type:

configuration file:  alice_rc_hbt_production_mss.conf
 ReplicaManager Error: Unknown error: File
ccgridli07.in2p3.fr//edg/StorageElement/prod/alice/run_3.09.06_00809.log
already exists (Error code:200019) in function copyAndRegisterFile
Unknown error. Exiting...

Other collected error messages are:

Mon May 12 09:36:44 CEST 2003
The job number is: 00859

Connecting to host gm03.hep.ph.ic.ac.uk, port 7771
Failed to establish security context (init):
    Some Other GSS failure
    globus_gss_assist token :3: read failure: Connection closed
    GSS status: major:01090000 minor: 00000000 token: 00000003
**** Error: RB_API_ERR ****
"Connection error!" returned from "get_multiattribute_list" api
The error might arise from an expired certificate or when trying to restore certification expiration date and job are in some state in the Grid execution list.
**** Warning: RB_CONNECTION_FAILURE ****
Unable to connect to RB "grid004.ca.infn.it"
**** Error: UI_NO_RB_CONTACT ****
Unable to contact any broker supplied
This error is due to certification problems, i.e. missing grid-map-files, etc.

Not all error messages reach end users. If the InputSandbox file transfer fails the information is registered in the RB error space that is available only to site managers. In one case in which the job aborted just after job submission, the RB error log reported the following message:

===================================================================
JSSparser.log.old:Last job terminated (416) aborted due to:
   Cannot download script_mss.sh from
gsiftp://grid004.ca.infn.it/tmp/https:__grid004.ca.infn.it:
       7846_131.154.99.139_08523874748463_grid004.ca.infn.it:7771/input/
JSSparser.log.old:Last job terminated (417) aborted due to:
   Cannot download script_mss.sh from
gsiftp://grid004.ca.infn.it/tmp/https:__grid004.ca.infn.it:
      7846_131.154.99.139_08525975325054_grid004.ca.infn.it:7771/input/
====================================================================

The above message might be the result of a network timeout or RB space full. Most probably the following error message is due to a similar problem.

dg_JobId              = https://grid004.ca.infn.it:.............
Status                = Done
Last Update Time (UTC)= Thu May 29 11:48:33 2003
Job Destination       = testbed008.cnaf.infn.it:2119/jobmanager-pbs-long
Status Reason         = the job manager failed to open stderr, giving up

Disk space usage and sharing among VOs may cause problems, i.e. quotas, ownership, etc.

Data management facilities, wild cards equivalent, ACLs, moving files from one SE to the other and from disk to MSS, etc.


next up previous contents
Next: Bibliography Up: Grid Progress and Usage Previous: Finding File Replicas   Contents
luvisetto 2003-07-25