Setonix DVS issue requires nodes to be rebooted
Resolved
Setonix has been operating normally since Wednesday (30th)
Posted Dec 01, 2022 - 09:00 AWST
Monitoring
Many of the impacted nodes have been returned to service and are running user jobs. Pawsey staff are validating the remaining nodes before returning them back into service
Posted Nov 30, 2022 - 17:24 AWST
Update
We have fenced off an additional management plane node and are migrating services. Once this has completed we should start rebooting compute nodes
Posted Nov 30, 2022 - 09:31 AWST
Identified
Last week a hardware failure in the management plane for Setonix prevented the automatic failover for some of the infrastructure services required to maintain the setonix compute nodes. This issue with DVS (used to make filesystems available to compute nodes) has now been resolved, but will require all the affected compute nodes on Setonix to be rebooted.
Nodes have been set to drain and will be restarted once jobs on those nodes have completed.
Posted Nov 29, 2022 - 13:59 AWST
This incident affected: Setonix (Setonix work partition).