High number of Garrawarla compute nodes offline
Resolved
This incident was resolved earlier in the week
Posted Sep 09, 2022 - 19:12 AWST
Monitoring
2 OSSs for astro were restarted, and locked compute nodes rebooted
Posted Sep 05, 2022 - 10:15 AWST
Identified
Although the HA software is showing everything "normal" on the storage servers, we're seeing a lot of failed connections to the OSTs mounted on astrofs-oss6. I suspect a reboot will be needed of one or more parts of the storage system, and once that's looking healthy we can work on restoring availability of compute nodes. Given we have a scheduled outage tomorrow for disk work, we may bring some parts of this work forwards to reduce downtime.
No ETA for service being back to normal I'm afraid
Posted Sep 05, 2022 - 08:32 AWST
Investigating
Automated health checks have marked a large number of nodes offline in Garrawarla, either due to timeouts or slurm being unable to kill processes at the end of a job.

These generally indicate an issue with one or more filesystems, or the infiniband network used for inter-node communications.

Pawsey staff will investigate during normal office hours.
Posted Sep 04, 2022 - 06:17 AWST
This incident affected: Garrawarla (Garrawarla compute nodes).