SLURM Controller on Setonix Crashing

Incident Report for Pawsey Supercomputing Research Centre

Resolved

We are waiting for a root cause from HPE, but the scheduler has been stable since Thursday

Posted May 16, 2023 - 11:48 AWST

Monitoring

We are still waiting for a root cause for the failure, but the scheduler has been operational over the weekend and we will continue to monitor it.

Posted May 15, 2023 - 07:59 AWST

Update

We have tweaked the configuration of the scheduler and have removed the reservation preventing jobs from running. We have a case lodged with HPE which has resulted in a case being lodged with SchedMD. We will monitor the scheduler, and continue to press for a root cause to be identified.

Posted May 11, 2023 - 11:19 AWST

Investigating

The SLURM controller Setonix appears to be crashing and restarting itself. We are raising this as a critical issue with HPE.

Posted May 10, 2023 - 14:17 AWST

This incident affected: Setonix (Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition).