When a job is ready and the executing script and job description file are debugged and working, the user activates the certificate and the job is submitted to the grid.
The job is made of the
jdl file, the job scripts and any other input fileas declared in the input sandbox. I will call this set of files a job-set. The UI copies the job-set to the selected RB, either the default
one, or the one specified in the submission command.
The RB processes the user requirements and matches them with the advertised capabilities of the currently published CEs. At match success, the job-set is transferred to the selected CE that in turn will send the job-set to the best matching WN for execution.
The CE must support a local batch facility that allows queueing of jobs possibily with load balancing capabilities. As Unix does not support a native advanced batch system, each site may choose one of the many open or proprietary queue managers like Condor, PBS, LSF, etc. At present EDG supports PBS as default batch system.
To process grid jobs, the CE maps the certificate of the job owner to a local virtual user belonging to the user VO group. Each CE supports a number of such virtual users big enough to allow efficient usage of the computing nodes. Each WN is described to the queue manager in terms of CPUs, usually 2 for dual-cpu nodes or 4 for multi-threading mother boards. The CPU parameter tells the queue manager how many simultaneous jobs may run on each WN. Queue setup is a matter of local policy that may define several queue types with different priority depending on job needs, i.e. CPU time, memory, disk space, MSS ( mass storage) access, etc.
When the CE receives a job from the Grid, the job is queued, its status is monitored and published to the RB and execution starts on a WN as a local job. All files are stored locally. Therefore the WN must have enough disk space to be able to fit all job data for all the jobs that the WN can run at the same time.
When the job terminates, the CE queue manager informs the RB. All output sandbox files are transferred from the WN to the RB and are ready for collection from the UI. The user is informed of job termination by the job status handler and by e-mail if the notify flag was set at job submission. When the user retrieves the output files, the RB copies the files to the UI and deletes the local copy. If some interruption occurs in this phase, output sandbox files are lost.
Output files that must survive job execution time may be stored in a permanent archive area. Storage server and space keeper functions in the grid are performed by SE nodes as described in Section 7.4.