Power loss across multiple systems
Postmortem

This outage was caused by the total load on one of the switchboards (that provides static UPS backed power to the supercompute cell) exceeding the adjustable trip limit during ‘burn in’ testing of the new Garrawarla cluster.

This has been remediated by switching the new Garrawarla compute nodes to an alternative supply, and the adjustable limit will be increased during the upcoming September maintenance window, which will require a shutdown of all equipment connected to that distribution board, and any services that depend on that equipment.

Posted Aug 20, 2020 - 15:36 AWST

Resolved
This incident has been resolved.
Posted Jul 08, 2020 - 10:44 AWST
Monitoring
Supercomputing clusters: Magnus, Galaxy, Zeus, and Topaz, along with lustre filesystems, have returned to operation.
Posted Jul 03, 2020 - 16:41 AWST
Identified
One of the lustre filesystems (/astro) isn't coming back online cleanly. A call has been lodged with the vendor and we're awaiting an update from them before proceeding. Supercompute services will remain offline until the filesystem issue is addressed.
Posted Jul 02, 2020 - 21:05 AWST
Update
All the affected services are fed via a single sub-board. CSIRO facilities staff are on their way to site to investigate cause and Pawsey staff will commence service recovery when given the all-clear
Posted Jul 02, 2020 - 16:25 AWST
Update
We are continuing to investigate this issue.
Posted Jul 02, 2020 - 15:57 AWST
Investigating
We are investigating a possible power loss to some parts of the Pawsey infrastructure, it seems to be services located in the supercompute cell. Staff are investigating and will update shortly. See also https://support.pawsey.org.au/documentation/display/US/I-2020-07-02-Pawsey
Posted Jul 02, 2020 - 15:34 AWST
This incident affected: ASKAP (ASKAP ingest nodes), Storage Systems (Banksia, CASDA Nodes), and Lustre filesystems (/askapbuffer filesystem).