Work is still in progress on a permanent fix for the issue, but we’re providing an update now since cancelling running and queued jobs is not something we take lightly.
Since the outage on Thursday night, the software company that produces Slurm (the job scheduler in use at Pawsey) have identified the probable cause of the lock up. Pawsey has not been the only user of Slurm 20.02 to discover this issue, and a fix was committed 15 days ago to their development branch (after our maintenance session) and it will be in the 20.02.4 release. The cause was a regression with task distribution, notably
if ((job_res->cpus[n] < avail_cpus[n]) ||
over_subscribe)
continue;
which caused an infinite loop in the slurmctld when processing jobs that requested certain parameters, rather than incrementing a variable.
Since we have only seen one job that triggered this since we upgraded to 20.02, we may implement some submission time checks or cancel jobs we know may cause issues until the permanent fix is implemented.
Yours
Andrew