Disk array failure in cxfs cluster

Incident Report for Pawsey Supercomputing Research Centre

Resolved

The unscheduled outage of Pawsey storage systems has been resolved

Posted Sep 10, 2020 - 17:45 AWST

Update

Our vendor has replaced more hardware and the drive pools are rebuilding. Once that's completed, Pawsey staff will reboot the array again and commence recovery of the filesystems if the hardware comes up cleanly.

Posted Sep 09, 2020 - 16:30 AWST

Update

Our vendor is still working on the issue, but this has been hampered by a public holiday in the US

Posted Sep 08, 2020 - 16:46 AWST

Update

Although the array enclosure has been replaced and successfully powered up, our vendor is studying diagnostic logs before making any further recommendations to service recovery.

Posted Sep 07, 2020 - 18:39 AWST

Update

Updated list of impacted services

Posted Sep 07, 2020 - 15:39 AWST

Identified

One of the disk arrays that compose the HSM filesystem has just failed, and the filesystems that use it are unresponsive.
Pawsey staff are shutting down the nodes in the cxfs cluster to allow the on-site vendor engineer to repair the array.

At this stage we have no indication of damage, or ETA for services being restored

Posted Sep 07, 2020 - 15:23 AWST

This incident affected: Storage Systems (Banksia, Data Portal Systems, MWA Nodes, CASDA Nodes).