next up previous contents
Next: Data Storage and Permanent Up: Grid Jobs in Detail Previous: Resources and the Information   Contents

Remarks on Job Queue Selection

When a job is submitted with default options, the RB checks the JDL parameters and selects an appropriate queue on a free CE (Computing Element). The selection should operate a best match so that the less loaded WN (Working Node) is choosen for the job, granting the minimum possible elapsed time for job execution.

In my experience, I have not been able neither to verify this best match feature nor to understand how the job scheduling process works. I will try to show what happens with a real life example.
The submitted job is this very simple script with a minimal JDL.


simple.sh
#/bin/sh!
#
echo -n "Start date is: "
date
echo "First JDL test"
echo -n "on host: "
hostname
echo -n "End date is: "
date
simple.jdl
Executable = "simple.sh";
StdOutput = "simple.out";
StdError = "simple.err";
InputSandbox = {"simple.sh"};
OutputSandbox = {"simple.out", "simple.err"};


I use dg-job-list-match to find CE candidates at the selected RB:

$ dg-job-list-match -c MyConfig.cfg simple.jdl

I have always noticed that the job is submitted to the first CE in the match and frequently this means also possible problems with overloaded CEs as shown in the following examples:

If the job fails, the user must choose another CE and try resubmission using the --resource ce_id option hoping to choose an available not overloaded CE, thus bypassing one of the basic Grid features.

As my test job is very simple and the short queue should handle fast jobs, the allowed elapsed time is expected in the range of minutes. Thus a persistent Scheduled status probably reflects some Grid problem.

In this example, the list match process lists many other CEs and at least one is under my control as it belongs to the local grid setup where I share the site administrator task. This CE is free and no job is present in the PBS queue, as I can check logging into the CE as administrator. Therefore I cancel the job and resubmit it forcing the selection of my own CE. The job run in minutes, producing the expected output:

$ cat /tmp/071033176036971/simple.out
Start date is: Wed Apr 30 09:09:54 CEST 2003
First JDL test
on host:  boalice4.bo.infn.it
End date is:   Wed Apr 30 09:09:54 CEST 2003

Note also that an efficient match making process with a flexible and reliable identification of really available and load balanced CEs is only the first step of successfull Grid execution. The match making process must not only choose the proper CE but also the appropiate queue supported by the CE, this means that the JDL parameters must reach in some way to the queue manager. Just an example: if the job is consuming more memory than the selected queue allows, the job is aborted by the queue manager during execution even if the JDL specified other.MinPhysicalMemory>400.


next up previous contents
Next: Data Storage and Permanent Up: Grid Jobs in Detail Previous: Resources and the Information   Contents
luvisetto 2003-07-25