Disk array failure in cxfs cluster
Incident Report for Pawsey Supercomputing Centre
Resolved
The unscheduled outage of Pawsey storage systems has been resolved
Posted Sep 10, 2020 - 17:45 AWST
Update
Our vendor has replaced more hardware and the drive pools are rebuilding. Once that's completed, Pawsey staff will reboot the array again and commence recovery of the filesystems if the hardware comes up cleanly.
Posted Sep 09, 2020 - 16:30 AWST
Update
Our vendor is still working on the issue, but this has been hampered by a public holiday in the US
Posted Sep 08, 2020 - 16:46 AWST
Update
Although the array enclosure has been replaced and successfully powered up, our vendor is studying diagnostic logs before making any further recommendations to service recovery.
Posted Sep 07, 2020 - 18:39 AWST
Update
Updated list of impacted services
Posted Sep 07, 2020 - 15:39 AWST
Identified
One of the disk arrays that compose the HSM filesystem has just failed, and the filesystems that use it are unresponsive.
Pawsey staff are shutting down the nodes in the cxfs cluster to allow the on-site vendor engineer to repair the array.

At this stage we have no indication of damage, or ETA for services being restored
Posted Sep 07, 2020 - 15:23 AWST
This incident affected: Storage Systems (Data Portal Systems, Hierarchical Storage Management Systems, MWA Nodes, CASDA Nodes).