4 Starting a SALOME application in a batch manager
5 ================================================================
7 This section explains how SALOME is used with batch managers used in the operation of clusters.
8 The objective is to run a SALOME application with a command script on a cluster starting from a
9 SALOME session running on the user's personal machine. The script contains the task that the user
10 wants SALOME to execute. The most usual task is to start a YACS scheme.
14 The start principle is as follows: starting from a first SALOME session, a SALOME application is started
15 on a cluster using the batch manager. Therefore there are two SALOME installations: one on the user’s machine
16 and the other on the cluster. The user must have an account on the cluster, and must have a read/write access to it.
17 He must also have correctly configured the connection protocol from his own station, regardless of whether he uses rsh or ssh.
19 The remainder of this chapter describes the different run steps. Firstly, a SALOME application is run on
20 the user’s machine using a CatalogResources.xml file containing the description of the target batch
21 machine (see :ref:`catalogresources_batch`). The user then calls the SALOME service to run it on the batch machine.
22 The user does this by describing input and output files for the SALOME application running in batch
23 and for the Python script to be run (see :ref:`service_launcher`). This service then starts the SALOME
24 application defined in the CatalogResources.xml file on the batch machine and executes the Python
25 command file (see :ref:`salome_clusteur_batch`).
27 .. _catalogresources_batch:
29 Description of the cluster using the CatalogResource.xml file
30 --------------------------------------------------------------------
32 The CatalogResources.xml file contains the description of the different distributed calculation
33 resources (machines) that SALOME can use to launch its containers. It can also contain the description
34 of clusters administered by batch managers.
36 The following is an example of description of a cluster:
40 <machine name = "clusteur1"
41 hostname = "frontal.com"
46 canLaunchBatchJobs = "true"
48 appliPath = "/home/user/applis/batch_exemples"
49 batchQueue = "mpi1G_5mn_4p"
50 userCommands = "ulimit -s 8192"
51 preReqFilePath = "/home/ribes/SALOME5/env-prerequis.sh"
56 nbOfProcPerNode = "2"/>
58 The following is the description of the different fields used when launching a batch:
59 - **name**: names the cluster for SALOME commands. Warning, this name is not used to identify the cluster front end.
60 - **hostname**: names the cluster front end. It must be possible to reach this machine name using the protocol
61 defined in the file. This is the machine that will be used to start the batch session.
62 - **protocol**: fixes the connection protocol between the user session and the cluster front end.
63 The possible choices are rsh or ssh.
64 - **userName**: user name on the cluster.
65 - **type**: identifies the machine as a single machine or a cluster managed by a batch. The possible choices are
66 "single_machine" or "cluster". The "cluster" option must be chosen for the machine to be accepted as a cluster with a batch manager.
67 - **batch**: identifies the batch manager. Possible choices are: pbs, lsf or sge.
68 - **mpi**: SALOME uses mpi to Start the SALOME session and containers on different calculation nodes allocated
69 by the batch manager. Possible choices are lam, mpich1, mpich2, openmpi, slurm and prun. Note that some
70 batch managers replace the mpi launcher with their own launcher for management of resources, which is the
71 reason for the slurm and prun options.
72 - **appliPath**: contains the path of the SALOME application previously installed on the cluster.
73 - **canLaunchBatchJobs**: Indicates that the cluster can be used to launch batch jobs. Must be set to "true"
74 in order to use this cluster to launch a schema in batch mode.
76 There are two optional fields that can be useful depending on the configuration of clusters.
77 - **batchQueue**: specifies the queue of the batch manager to be used
78 - **userCommands**: to insert the sh code when SALOME is started. This code is executed on all nodes.
83 Using the Launcher service
84 -------------------------------
85 The Launcher service is a CORBA server started by the SALOME kernel. Its interface is described in the
86 **SALOME_Launcher.idl** file of the kernel.
88 Its interface is as follows:
92 interface SalomeLauncher
95 long createJob (in Engines::JobParameters job_parameters) raises (SALOME::SALOME_Exception);
96 void launchJob (in long job_id) raises (SALOME::SALOME_Exception);
97 string getJobState (in long job_id) raises (SALOME::SALOME_Exception);
98 string getAssignedHostnames (in long job_id) raises (SALOME::SALOME_Exception); // Get names or ids of hosts assigned to the job
99 void getJobResults(in long job_id, in string directory) raises (SALOME::SALOME_Exception);
100 boolean getJobDumpState(in long job_id, in string directory) raises (SALOME::SALOME_Exception);
101 void stopJob (in long job_id) raises (SALOME::SALOME_Exception);
102 void removeJob (in long job_id) raises (SALOME::SALOME_Exception);
105 long createJobWithFile(in string xmlJobFile, in string clusterName) raises (SALOME::SALOME_Exception);
106 boolean testBatch (in ResourceParameters params) raises (SALOME::SALOME_Exception);
108 // SALOME kernel service methods
112 // Observer and introspection methods
113 void addObserver(in Engines::SalomeLauncherObserver observer);
114 void removeObserver(in Engines::SalomeLauncherObserver observer);
115 Engines::JobsList getJobsList();
116 Engines::JobParameters getJobParameters(in long job_id) raises (SALOME::SALOME_Exception);
118 // Save and load methods
119 void loadJobs(in string jobs_file) raises (SALOME::SALOME_Exception);
120 void saveJobs(in string jobs_file) raises (SALOME::SALOME_Exception);
124 The **createJob** method creates the job itself and returns a **job** identifier that can be used in the
125 **launchJob**, **getJobState**, **stopJob** and **getJobResults** methods. The **launchJob** method
126 submits the job to the batch manager.
128 The following is an example using those methods:
135 launcher = salome.naming_service.Resolve('/SalomeLauncher')
137 # The python script that will be launched on the cluster
138 script = '/home/user/Dev/Install/BATCH_EXEMPLES_INSTALL/tests/test_Ex_Basic.py'
140 # Define job parameters
141 job_params = salome.JobParameters()
142 job_params.job_name = "my_job"
143 job_params.job_type = "python_salome"
144 job_params.job_file = script
145 job_params.in_files = []
146 job_params.out_files = ['/scratch/user/applis/batch_exemples/filename']
148 # Define resource parameters
149 job_params.resource_required = salome.ResourceParameters()
150 job_params.resource_required.name = "clusteur1"
151 job_params.resource_required.nb_proc = 24
153 # Create and submit the job
154 jobId = launcher.createJob(job_params)
155 launcher.submitJob(jobId)
157 The following is a description of the main parameters of **JobParameters** structure:
159 - **job_type**: This is the type of the job to run (use "python_salome" to run a Python script in a Salome session).
160 - **job_file**: This is the python script that will be executed in the SALOME application on the cluster.
161 This argument contains the script path **on** the local machine and **not on** the cluster.
162 - **in_files**: this is a list of files that will be copied into the run directory on the cluster
163 - **out_files**: this is a list of files that will be copied from the cluster onto the user machine when the **getJobResults** method is called.
164 - **resource_required**: contains the description of the required machine. In this case, the cluster on which the application is to be launched
165 is clearly identified.
167 The **getJobState** method should be used to determine the state of the Job. The following is an example of how this method is used:
171 status = launcher.getJobState(jobId)
172 print jobId,' ',status
173 while(status != 'FINISHED'):
174 os.system('sleep 10')
175 status = launcher.getJobState(jobId)
176 print jobId,' ',status
178 Finally, the **getJobResults** method must be used to retrieve application results.
179 The following is an example of how to use this method:
182 launcher.getJobResults(jobId, '/home/user/Results')
184 The second argument contains the directory in which the user wants to retrieve the results. The user automatically receives
185 logs from the SALOME application and the different containers that have been started, in addition to those defined in the **out_files** list.
187 .. _salome_clusteur_batch:
189 SALOME on the batch cluster
190 ----------------------------------------------------
191 SALOME does not provide a service for automatic installation of the platform from the user’s personal machine, for the moment.
192 Therefore, SALOME (KERNEL + modules) and a SALOME application have to be installed beforehand on the cluster.
193 In the example used in this documentation, the application is installed in the directory **/home/user/applis/batch_exemples**.
195 When the **submitJob** method is being used, SALOME creates a directory in $HOME/Batch/**run_date**.
196 The various input files are copied into this directory.
198 SALOME constraints on batch managers
199 ----------------------------------------------------
200 SALOME needs some functions that the batch manager must authorise before SALOME applications can be run.
202 SALOME runs several processor **threads** for each CORBA server that is started.
203 Some batch managers can limit the number of threads to a number that is too small, or the batch manager may configure the size
204 of the thread stack so that it is too high.
205 In our example, the user fixes the size of the thread stack in the **userCommands** field in the CatalogResources.xml file.
207 SALOME starts processes in the session on machines allocated by the batch manager. Therefore, the batch manager must authorise this.
208 Finally, SALOME is based on the use of dynamic libraries and the **dlopen** function. The system must allow this.