Slurm is unavailable on Setonix

Incident Report for Pawsey Supercomputing Research Centre

Resolved

The SLURM controller appears to have recovered from its indigestion. We are waiting for some advice from HPE about how to tune SLURM to avoid this situation in the future.

Posted Mar 22, 2024 - 08:15 AWST

Monitoring

We have been notified that the SLURM controller had reverted to using FirstJobId and we have updated it to 9999999 and recreated the controller POD.

Posted Mar 21, 2024 - 14:21 AWST

Update

SLURM controller was restarted last night with increased resources. Pawsey staff continue to monitor the scheduler like a hawk.

Posted Mar 21, 2024 - 08:13 AWST

Investigating

Slurm is unavailable on Setonix
* Queries or jobs runs via slurm is unavailable
* Slurm is in crash loop back state

Posted Mar 20, 2024 - 21:48 AWST

This incident affected: Setonix (Data-mover nodes, Slurm scheduler).