Zeus slurm info unavailable

Incident Report for Pawsey Supercomputing Research Centre

Postmortem

Work is still in progress on a permanent fix for the issue, but we’re providing an update now since cancelling running and queued jobs is not something we take lightly.

Since the outage on Thursday night, the software company that produces Slurm (the job scheduler in use at Pawsey) have identified the probable cause of the lock up. Pawsey has not been the only user of Slurm 20.02 to discover this issue, and a fix was committed 15 days ago to their development branch (after our maintenance session) and it will be in the 20.02.4 release. The cause was a regression with task distribution, notably

if ((job_res->cpus[n] < avail_cpus[n]) ||
over_subscribe)
continue;

which caused an infinite loop in the slurmctld when processing jobs that requested certain parameters, rather than incrementing a variable.

Since we have only seen one job that triggered this since we upgraded to 20.02, we may implement some submission time checks or cancel jobs we know may cause issues until the permanent fix is implemented.

Yours

Andrew

Posted Jul 25, 2020 - 18:43 AWST

Resolved

We were unable to successfully restart the slurm controller (slurmctld) for zeus without clearing the Slurm saved state. This means that any running jobs would have been lost around 10:15 this morning, together with any queued jobs.
Users are asked to resubmit any jobs back into the scheduler that were lost.

Posted Jul 24, 2020 - 10:28 AWST

Investigating

We are currently investigating an issue with the zeus slurm controller which has been logging errors since 03:30 this morning

Posted Jul 24, 2020 - 06:59 AWST