Slurm is unavailable on Setonix
Resolved
The SLURM controller appears to have recovered from its indigestion. We are waiting for some advice from HPE about how to tune SLURM to avoid this situation in the future.
Posted Mar 22, 2024 - 08:15 AWST
Monitoring
We have been notified that the SLURM controller had reverted to using FirstJobId and we have updated it to 9999999 and recreated the controller POD.
Posted Mar 21, 2024 - 14:21 AWST
Update
SLURM controller was restarted last night with increased resources. Pawsey staff continue to monitor the scheduler like a hawk.
Posted Mar 21, 2024 - 08:13 AWST
Investigating
Slurm is unavailable on Setonix
* Queries or jobs runs via slurm is unavailable
* Slurm is in crash loop back state
Posted Mar 20, 2024 - 21:48 AWST
This incident affected: Setonix (Data-mover nodes, Slurm scheduler).