High number of Garrawarla compute nodes offline

Incident Report for Pawsey Supercomputing Research Centre

Resolved

This incident was resolved earlier in the week

Posted Sep 09, 2022 - 19:12 AWST

Monitoring

2 OSSs for astro were restarted, and locked compute nodes rebooted

Posted Sep 05, 2022 - 10:15 AWST

Identified

Although the HA software is showing everything "normal" on the storage servers, we're seeing a lot of failed connections to the OSTs mounted on astrofs-oss6. I suspect a reboot will be needed of one or more parts of the storage system, and once that's looking healthy we can work on restoring availability of compute nodes. Given we have a scheduled outage tomorrow for disk work, we may bring some parts of this work forwards to reduce downtime.
No ETA for service being back to normal I'm afraid

Posted Sep 05, 2022 - 08:32 AWST

Investigating

Automated health checks have marked a large number of nodes offline in Garrawarla, either due to timeouts or slurm being unable to kill processes at the end of a job.

These generally indicate an issue with one or more filesystems, or the infiniband network used for inter-node communications.

Pawsey staff will investigate during normal office hours.

Posted Sep 04, 2022 - 06:17 AWST