In this section, I try to report all the problems I met during job submmission of production jobs. Due to the grid complexity and the amount of network transfers, possible failures may take place in any point of the processing chain. As grid computing is a highly dynamical task with a big amount of temporary files and areas that must be promptly cleaned up to free space to newly submitted job, problem debugging is a very complex tasks as information may be deleted while trying to figure out the problem.
Production jobs are usually resource bound. In my case the jobs required a very long CPU time ( hours), more than 400 Mbytes memory, nearly 2Gb disk space and MSS storage access. Only very few CEs offered such resources, thus limiting site access with the net results that jobs were usually queued on overloaded CEs. Taking also into account that at least 20-30 jobs represented the average daily submission, also the available RBs were overloaded.
The RB checks that there is enough space left for Input/Output Sandbox files. If space is exausted in the WN file space, there is no booking tool to reserve space in advance .... File space granularity ??? Space handling problem....
When a job is submitted to the Grid, the input files defined in the
JDL InputSandbox files are migrated from the UI to the computing WN
through the RB using
globus-url-copy. If this operation fails, the job submission aborts
with a jobwrapper message:
Failure while executing job wrapper.
The problem is usually due to the RB resource saturation (sandbox space exhausted).
As RB temporary space is shared among jobs and is of limited size, do not direct to sandbox areas big files. The sandbox areas are designed to hold submitting scripts, log and error files of limited size. Bigger I/O files should be stored in the SE. As at present I/O is performed from the local disk space of the WN, temporary space might be a problem for big output files.
Common problems are:
error: the server sent an error response: 425 425 Can't open data connection. timed out() failed.This is usually due to an error detetcted by
Segmentation Violationwhen a C program opens in input a missing file, this may happen when inputsandbox is not already transferred on the WN
More informations on job submission is given by
Possible explanations about job wrapper error are given below:
The standard output of the script where the user job is wrapped around (and which also includes the transfer of the input and output sandboxes) and which is transferred via Globus GASS from the CE node to the RB machine in order to check if the job was successfully executed. Many reasons for this problem were found and addressed, e.g.: - Globus GASS cache problems (fixed via a patch provided by the Condor team) - Exhausted resources on the CE head node: sysadmins must take care to properly set some sysctl parameters - Race conditions for file updates between the worker node and the CE node: fixed via a workaround implemented in the jobmanager by WP1 - PBS doesn't remember, through the "qstat" interface, about terminated jobs and Globus considers a job that is not found by "qstat" as a "done" job: addressed via a new globus pbs script provied by WP1 Besides the above mentioned fixes/workarounds, in order to reduce the rate of failures, a modification was introduced in the WMS software (the so-called JSS-Maradona), in order to transfer the standard output of job wrappers also via gridftp. The error message so means that the standard output of job wrapper, in spite of all these fixes/workarounds, was not available. There could be various reasons, for example the job was dispatched to a WN where the home directory was not accessible
In case of gatekeeper problems, the job fail with a generic Globus Failure message:
Status = Aborted Status Reason = Globus Failure:
In case of some strange error in storing file on the RC, you may get a message of the following type:
configuration file: alice_rc_hbt_production_mss.conf ReplicaManager Error: Unknown error: File ccgridli07.in2p3.fr//edg/StorageElement/prod/alice/run_3.09.06_00809.log already exists (Error code:200019) in function copyAndRegisterFile Unknown error. Exiting...
Other collected error messages are:
Mon May 12 09:36:44 CEST 2003 The job number is: 00859 Connecting to host gm03.hep.ph.ic.ac.uk, port 7771 Failed to establish security context (init): Some Other GSS failure globus_gss_assist token :3: read failure: Connection closed GSS status: major:01090000 minor: 00000000 token: 00000003 **** Error: RB_API_ERR **** "Connection error!" returned from "get_multiattribute_list" apiThe error might arise from an expired certificate or when trying to restore certification expiration date and job are in some state in the Grid execution list.
**** Warning: RB_CONNECTION_FAILURE **** Unable to connect to RB "grid004.ca.infn.it" **** Error: UI_NO_RB_CONTACT **** Unable to contact any broker suppliedThis error is due to certification problems, i.e. missing grid-map-files, etc.
Not all error messages reach end users. If the InputSandbox file transfer fails the information is registered in the RB error space that is available only to site managers. In one case in which the job aborted just after job submission, the RB error log reported the following message:
=================================================================== JSSparser.log.old:Last job terminated (416) aborted due to: Cannot download script_mss.sh from gsiftp://grid004.ca.infn.it/tmp/https:__grid004.ca.infn.it: 7846_188.8.131.52_08523874748463_grid004.ca.infn.it:7771/input/ JSSparser.log.old:Last job terminated (417) aborted due to: Cannot download script_mss.sh from gsiftp://grid004.ca.infn.it/tmp/https:__grid004.ca.infn.it: 7846_184.108.40.206_08525975325054_grid004.ca.infn.it:7771/input/ ====================================================================
The above message might be the result of a network timeout or RB space full. Most probably the following error message is due to a similar problem.
dg_JobId = https://grid004.ca.infn.it:............. Status = Done Last Update Time (UTC)= Thu May 29 11:48:33 2003 Job Destination = testbed008.cnaf.infn.it:2119/jobmanager-pbs-long Status Reason = the job manager failed to open stderr, giving up
Disk space usage and sharing among VOs may cause problems, i.e. quotas, ownership, etc.
Data management facilities, wild cards equivalent, ACLs, moving files from one SE to the other and from disk to MSS, etc.