-
Notifications
You must be signed in to change notification settings - Fork 461
RemoteJobs
BOINC provides an API for remotely submitting, monitoring, and controlling jobs on a BOINC server. The API lets you submit batches of jobs, which may contain a single job or many thousands of jobs.
The API uses Web RPCs (XML over HTTP). BOINC provides bindings in Python, PHP, and C++. These bindings differ slightly; they expose different details of the Web RPCs.
APIs for staging input files and fetching output files are described elsewhere.
There are various options for managing input files. For example, job-based file management, maintains batch/file associations, allowing the deletion of files when they are no longer used. In this case the order of operations is:
- Create a batch (initially empty); returns the batch ID.
- Stage input files, passing the batch ID
- Submit jobs, passing the batch ID
Or you can use User-file-sandbox, in which users explicitly manage files on the server. In this case you can create the batch and submit jobs in a single API call.
Once you have submitted a batch, you can
-
Monitor the batch with
query_batches()
,query_batch()
, orquery_job()
. - Abort the batch (if you see errors, or if enough jobs have been finished) using abort_jobs() or abort_batch().
- Download output files.
- Retire the batch using retire_batch(). This tells the server to clean up the files and job records associated with the batch, and to mark the batch as "retired"; retired batches are normally not shown in the web interface.
The file lib/boinc_submit.py
contains a Python binding of some of above RPCs.
For examples of its use, see tools/submit_api_test.py.
The binding consists of several classes:
FILE_DESC(source=None, mode='local_staged', nbytes=None, md5=None)
Represents an input file. mode
can be:
-
local_staged
: the file is already staged;source
is the physical name. -
local
: the file is on the server but not necessarily staged.source
is the full path. -
semilocal
:source
is a URL. On job submission, it will be fetched and staged. -
remote
:source
is a URL.nbytes
andmd5
must be supplied. The file will be fetched by the client. -
sandbox
:source
is the sandbox file name. -
inline
:source
is the contents of the file.
JOB_DESC(files=[], name=None, rsc_fpops_est=None, command_line=None,
input_template=None, output_template=None, priority=None
)
Represents a job.
files
is a list of FILE_DESC
objects.
You can specify the job parameters and templates.
BATCH_DESC(app_name, jobs, batch_id=0, batch_name=None,
expire_time=None, app_version_num=0,
priority=None, allocation_priority=False
)
Represents a batch of jobs for a given app.
-
app_name
: the application name. -
jobs
: a list ofJOB_DESC
objects. -
batch_id
: if the batch already exists, its ID -
batch_name
: if batch doesn't exist, the name to give it (if not specified, a name of the formuser:app:date
will be assigned. -
expire_time
: if specified, the batch will be retired after this time. -
app_version_num
: use only this app version number. -
priority
: give jobs this priority. -
allocation_priority
: if True, assign priority based on submitter's resource share.
BOINC_SERVER(url, authenticator, rpc_timeout=0)
Represents a BOINC project, together with the job submitter's account on that project.
-
url
: project URL. -
authenticator
: submitter's authenticator. -
rpc_timeout
: specify a long timeout if server is overloaded.
RPC operations are member functions of BOINC_SERVER
.
All operations return a Python dictionary.
In case of errors, this has the form
{'error': {
'error_num': '-1',
'error_msg': 'create_work failed: job is missing input template\n'
}}
The member functions of BOINC_SERVER
are:
submit_batch(batch_desc)
Submit a batch of jobs described by a BATCH_DESC
.
On success, return the batch ID:
{'batch_id': '441'}
create_batch(app_name, batch_name='', expire_time=0)
Create an empty batch and return its ID.
estimate_batch(batch_desc)
Estimate the time to complete a batch
(described by a BATCH_DESC
),
given its job sizes and the computing power of the project.
Returns:
{'seconds': '147396.27439935'}
abort_batch(batch_id)
Abort the given batch (cancel its remaining jobs).
retire_batch(batch_id)
Retire the given batch (delete its output files).
query_batch(batch_id, get_cpu_time=False, get_job_details=False)
Query the stats of the given batch. Returns a structure of the form
{'app_name': 'uppercase',
'completion_time': '0',
'create_time': '1719535409',
'credit_canonical': '0',
'credit_estimate': '23.148148148148',
'est_completion_time': '0',
'expire_time': '0',
'fraction_done': '0',
'id': '441',
'job': {'canonical_instance_id': '0',
'id': '111956',
'name': 'uppercase_18578_1719535409.652051_0'},
'name': 'David Anderson:uppercase:Fri, 28 Jun 2024 00:43:29 +0000',
'nerror_jobs': '0',
'njobs': '1',
'state': '1'}
query_batches(get_cpu_time)
query_completed_job(job_name)
Query the details of a job. Example output:
{'completed_job': {'canonical_resultid': '112229',
'cpu_time': '9.40625',
'elapsed_time': '42.232193',
'error_mask': '0',
'exit_status': '0',
'stderr_out': '<core_client_version>7.24.1</core_client_version>\n'
'<![CDATA[\n'
'<stderr_txt>\n'
...
abort_jobs(job_names)
Abort the given list of jobs.
get_job_counts()
Return counts of jobs in various states:
{'results_in_progress': '0',
'results_need_file_delete': '0',
'results_ready_to_send': '1',
'wus_need_assimilate': '3',
'wus_need_file_delete': '0',
'wus_need_validate': '0'}
The BOINC_SERVER
class also provides functions for
uploading input files.
An example of uploading input files and submitting jobs:
from boinc_submit import *
s = BOINC_SERVER('https://boinc.berkeley.edu/test/', 'auth')
ret = s.create_batch('uppercase')
batch_id = ret['batch_id']
s.upload_files(['file_a', 'file_b'], ['file_a.1', 'file_b.1'], batch_id)
j1 = JOB_DESC([FILE_DESC('file_a.1')]
j2 = JOB_DESC([FILE_DESC('file_b.1')]
b = BATCH_DESC('uppercase', [j1, j2], batch_id)
s.submit_batch(b)
The following functions are provided in the PHP file submit.inc, which is independent of other BOINC PHP code. The file html/user/submit_test.php has code to exercise and test these functions.
Submits a batch.
Arguments: a "request object" whose fields include
- project: the project URL
- authenticator: the user's authenticator
- app_name: the name of the application for which jobs are being submitted
- batch_name: a symbolic name for the batch. Must be unique. If omitted, a name of the form "batch_unixtime" will be used.
- input_template_filename: an optional input template file name.
- output_template_filename: an optional output template file name.
-
job_params: optional job parameters (include the ones you want to specify):
- rsc_disk_bound: limit on disk usage in bytes
- rsc_fpops_est: estimated computing in FLOPs
- rsc_fpops_bound: upper bound on computing (abort if exceeded)
- rsc_memory_bound: maximum memory usage
- delay_bound: maximum turnaround time: if exceeded, create another instance of job.
- app_version_num: if present, pins the jobs to a particular app version number.
- allocation_priority: if present and true, prioritize jobs according to submitter allocations.
- priority N: if present, give jobs priority N.
-
jobs: an array of job descriptors, each of which contains
- name: optional; the workunit name. If supplied, must be unique. Default is appname_pid_time.
- rsc_fpops_est: optional; an estimate of the FLOPs used by the job
- command_line: optional; command-line arguments to the application
- priority N: if present, give this job priority N
- input_template: optional; the input template to use for this job, as an XML string.
- output_template: optional; the output template to use for this job, as an XML string.
-
input_files: an array of input file descriptors, each of which contains
- mode: "local", "semilocal", "local_staged", "inline", or "remote" (see below).
-
source: meaning depends on mode:
- local: path on the BOINC server
- semilocal: the file's URL
- local_staged: physical name
- inline: the file's contents
- For "remote" mode, instead of "source" you must specify:
- url: the file's URL
- nbytes: file size
- md5: the file's MD5 checksum
Result: a 2-element array containing
- The batch ID
- An error message (null if success)
Input files can be supplied in any of the following ways:
- local: the file is on the BOINC server and is not staged. It's specified by its full path.
- local_staged: the filed has been staged on the BOINC server. It's specified by its physical name.
- semilocal: the file is on a data server that's accessible to the BOINC server but not necessarily to the outside world. The file is specified by its URL. It will be downloaded by the BOINC server during job submission, and served to clients from the BOINC server.
- inline: the file is included in the job submission request XML message. It will be served to clients from BOINC server.
- remote: the file is on a data server other than the BOINC server, and will be served to clients from that data server. It's specified by the URL, the file size, and the file MD5.
The following mode has been proposed but is not implemented yet:
- sandbox: the file is in the user's [file sandbox](User file sandbox), and is specified by its name in the sandbox.
The following example submits a 10-job batch:
$req = new StdClass;
$req->project = "http://foo.bar.edu/test/";
$req->authenticator = "xxx";
$req->app_name = "uppercase";
$req->jobs = array();
$f = new StdClass;
$f->mode = "local_staged";
$f->source = "filename.dat";
$job = new StdClass;
$job->input_files = array($f);
for ($i=10; $i<20; $i++) {
$job->rsc_fpops_est = $i*1e9;
$job->command_line = "--t $i";
$req->jobs[] = $job;
}
list($batch_id, $errmsg) = boinc_submit_batch($req);
if ($errmsg) {
echo "Error: $errmsg\n";
} else {
echo "Batch ID: $batch_id\n";
}
Note: this interfaces is inconsistent; it lets you do some things but not others. Let me know if you need additions.
Returns an estimate of the elapsed time required to complete a batch.
Arguments: same as boinc_submit_batch() (only relevant fields need to be populated).
Return value: a 2-element array containing
- The elapsed time estimate, in seconds
- An error message (null if success)
Returns a list of this user's batches, both in progress and complete.
Argument: a request object with elements
- project and authenticator: as above.
- get_cpu_time (optional): if nonzero, get CPU time of each batch
Result: a 2-element array. The first element is an array of batch descriptor objects, each with the following fields:
- id: batch ID
-
state: values are
- 1: in progress
- 2: completed (all jobs either succeeded or had fatal errors)
- 3: aborted
- 4: retired
- name: the batch name
- app_name: the application name
- create_time: when the batch was submitted
- est_completion_time: current estimate of completion time
- njobs: the number of jobs in the batch
- fraction_done: the fraction of the batch that has been completed (0..1)
- nerror_jobs: the number of jobs that had fatal errors
- completion_time: when the batch was completed
- credit_estimate: BOINC's initial estimate of the credit that would be granted to complete the batch, including replication
- credit_canonical: the actual credit granted to canonical instances
- credit_total: the actual credit granted to all instances
Gets batch details.
Argument: a request object with elements
- project and authenticator: as above
- batch_id: specifies a batch.
- get_cpu_time (optional): if nonzero, get CPU time of batch. This includes all job instances, and doesn't include GPU time, so it may not be meaningful.
- get_job_details (optional): if nonzero, return job details (see below).
Result: a 2-element array. The first element is a batch descriptor object as described above, with an additional element:
-
jobs: an array of job descriptor objects, each one containing
- id: the database ID of the job's workunit record
- canonical_instance_id: if the job has a canonical instance, its database ID
If get_job_details was set, the job descriptors also contain:
- status: "queued", "in_progress", "error", or "done".
- cpu_time: if done, the CPU time of canonical instance.
- exit_status: if error, the exit status of one of the error instances.
The order of job descriptors matches their order in the batch submission.
Gets job details.
Argument: a request object with elements:
- project and authenticator: as above
- job_id: specifies a job.
Result: a 2-element array. The first element is a job descriptor object with the following fields:
-
instances: an array of job instance descriptors, each containing:
- name: the instance's name
- id: the ID of the corresponding result record
- state: a string describing the instance's state (unsent, in progress, complete, etc.)
-
outfile: if the instance is over, a list of output file descriptors, each containing
- size: file size in bytes
Argument: a request object with elements
- project and authenticator: as above,
- batch_id: specifies a batch.
Result: an error message, null if successful
Delete server storage (files, DB records) associated with a batch.
Argument: a request object with elements
- project and authenticator: as above,
- batch_id: specifies a batch.
Result: an error message, null if successful
Set the RPC timeout to $x seconds.
A C++ binding is available in lib/remote_submit.cpp
.
Include lib/remote_submit.h
.
All functions return zero on success, else an error code as defined in `lib/error_numbers.h'.
Create a batch - a set of jobs, initially empty.
int create_batch(
const char* project_url,
const char* authenticator,
const char* batch_name,
const char* app_name,
double expire_time,
int &batch_id,
string& error_msg
);
-
project_url
: the project URL -
authenticator
: the authenticator of the submitting user -
batch_name
: a name for the batch. Must be unique over all batches. -
app_name
: the name of an application on the BOINC server -
expire_time
: if nonzero, the Unix time when the batch should be aborted and removed from the server, whether or not it's completed. -
batch_id
: (out) the batch's database ID -
error_msg
: (out) an error message if the operation failed
Get an estimate of the makespan of a (potential) batch.
int estimate_batch(
const char* project_url,
const char* authenticator,
char app_name[256],
vector<JOB> jobs,
double est_makespan,
string& error_msg
);
-
jobs
: description of jobs; same as for submit_jobs() (see below). -
est_makespan
: the estimated makespan (time to completion).
Submit a set of jobs; place them in an existing batch, and make them runnable.
int submit_jobs(
const char* project_url,
const char* authenticator,
char app_name[256],
int batch_id,
vector<JOB> jobs,
string& error_msg,
app_version_num = 0
);
int submit_jobs2(
const char* project_url,
const char* authenticator,
char app_name[256],
int batch_id,
vector<JOB> jobs,
string& error_msg,
app_version_num,
JOB_PARAMS&
);
struct JOB {
char job_name[256];
string cmdline_args;
vector<INFILE> infiles;
};
struct JOB_PARAMS {
// 0 means unspecified for all params
double rsc_disk_bound;
double rsc_fpops_est;
double rsc_fpops_bound;
double rsc_memory_bound;
double delay_bound;
};
struct INFILE {
int mode;
// FILE_MODE_LOCAL_STAGED: file is already on BOINC server, and staged
// FILE_MODE_REMOTE: file is on a different server
// the following if LOCAL_STAGED
char physical_name[256];
// the following if REMOTE
char url[256];
double nbytes;
char md5[256];
};
- 'batch_id': ID of a previously created batch
For each job:
-
job_name
: must be unique over all jobs -
cmdline_args
: command-line arguments -
infiles
: list of input files
For each input file:
-
physical_name
: the physical name for the file. The file must already be staged.
Query the status of this user's batches.
int query_batches(
const char* project_url,
const char* authenticator,
vector<BATCH_STATUS>& batches,
string& error_msg
);
struct BATCH_STATUS {
int id;
char name[256]; // name of batch
char app_name[256];
int state; // see lib/common_defs.h
int njobs; // how many jobs in batch
int nerror_jobs; // how many jobs errored out
double fraction_done; // how much of batch is done (0..1)
double create_time; // when batch was created
double expire_time; // when it will expire
double est_completion_time; // estimated completion time
double completion_time; // if completed, actual completion time
double credit_estimate; // original estimate for credit
double credit_canonical; // if completed, granted credit
int parse(XML_PARSER&);
void print();
};
Return the detailed status of jobs in a given batch (can specify by either ID or name).
extern int query_batch(
const char* project_url,
const char* authenticator,
int batch_id,
const char batch_name[256],
vector<JOB_STATE>& jobs,
string& error_msg
);
struct JOB_STATE {
int id;
char name[256];
int canonical_instance_id; // it job completed successfully,
// the ID of the canonical instance
int n_outfiles; // number of output files
};
Abort a set of jobs.
extern int abort_jobs(
const char* project_url,
const char* authenticator,
vector<string> &job_names,
string& error_msg
);
Query a completed job.
extern int query_completed_job(
const char* project_url,
const char* authenticator,
const char* job_name,
COMPLETED_JOB_DESC&,
string& error_msg
);
struct COMPLETED_JOB_DESC {
int canonical_resultid;
int error_mask;
int error_resultid;
int exit_status;
double elapsed_time;
double cpu_time;
string stderr_out;
};
-
canonical_resultid
: database ID of the "canonical" instance of the job. -
error_mask
: a bitmask of error conditions (see db/boinc_db_types.h) -
error_resultid
: the database ID of a failed instance, if one exists -
exit_status
: exit status of failed instance -
elapsed_time
: run time of canonical instance -
cpu_time
: CPU time of canonical instance -
stderr_out
: stderr output of canonical or failed instance
"Retire" a batch. The server is then allowed to delete the batch's input and output files, and its database records.
extern int retire_batch(
const char* project_url,
const char* authenticator,
const char* batch_name,
string& error_msg
);
Change the expiration time of a batch.
extern int set_expire_time(
const char* project_url,
const char* authenticator,
const char* batch_name,
double expire_time,
string& error_msg
);
Ping the project's server; return zero if the server is up.
extern int ping_server(
const char* project_url,
string& error_msg
);
Return the status of the jobs in a given set of batches. This is used by the Condor interface; it's probably not useful outside of that.
extern int query_batch_set(
const char* project_url,
const char* authenticator,
vector<string> &batch_names,
QUERY_BATCH_REPLY& reply,
string& error_msg
);
struct JOB_STATUS {
string job_name;
string status; // DONE, ERROR, or IN_PROGRESS
};
struct QUERY_BATCH_SET_REPLY {
vector<int> batch_sizes; // how many jobs in each of the queried batches
vector<JOB_STATUS> jobs; // the jobs, sequentially
};
At the bottom level, the APIs are accessed by sending a POST request,
using HTTP or HTTPS,
to PROJECT_URL/submit_rpc_handler.php.
The inputs and outputs of each function are XML documents.
The format of the request and reply XML documents can be
inferred from user/submit_rpc_handler.php
.