SLURM Controller on Setonix Crashing
Resolved
We are waiting for a root cause from HPE, but the scheduler has been stable since Thursday
Posted May 16, 2023 - 11:48 AWST
Monitoring
We are still waiting for a root cause for the failure, but the scheduler has been operational over the weekend and we will continue to monitor it.
Posted May 15, 2023 - 07:59 AWST
Update
We have tweaked the configuration of the scheduler and have removed the reservation preventing jobs from running. We have a case lodged with HPE which has resulted in a case being lodged with SchedMD. We will monitor the scheduler, and continue to press for a root cause to be identified.
Posted May 11, 2023 - 11:19 AWST
Investigating
The SLURM controller Setonix appears to be crashing and restarting itself. We are raising this as a critical issue with HPE.
Posted May 10, 2023 - 14:17 AWST
This incident affected: Setonix (Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition).