Power loss across multiple systems

Incident Report for Pawsey Supercomputing Research Centre

Postmortem

This outage was caused by the total load on one of the switchboards (that provides static UPS backed power to the supercompute cell) exceeding the adjustable trip limit during ‘burn in’ testing of the new Garrawarla cluster.

This has been remediated by switching the new Garrawarla compute nodes to an alternative supply, and the adjustable limit will be increased during the upcoming September maintenance window, which will require a shutdown of all equipment connected to that distribution board, and any services that depend on that equipment.

Posted Aug 20, 2020 - 15:36 AWST

Resolved

This incident has been resolved.

Posted Jul 08, 2020 - 10:44 AWST

Monitoring

Supercomputing clusters: Magnus, Galaxy, Zeus, and Topaz, along with lustre filesystems, have returned to operation.

Posted Jul 03, 2020 - 16:41 AWST

Identified

One of the lustre filesystems (/astro) isn't coming back online cleanly. A call has been lodged with the vendor and we're awaiting an update from them before proceeding. Supercompute services will remain offline until the filesystem issue is addressed.

Posted Jul 02, 2020 - 21:05 AWST

Update

All the affected services are fed via a single sub-board. CSIRO facilities staff are on their way to site to investigate cause and Pawsey staff will commence service recovery when given the all-clear

Posted Jul 02, 2020 - 16:25 AWST

Update

We are continuing to investigate this issue.

Posted Jul 02, 2020 - 15:57 AWST

Investigating

We are investigating a possible power loss to some parts of the Pawsey infrastructure, it seems to be services located in the supercompute cell. Staff are investigating and will update shortly. See also https://support.pawsey.org.au/documentation/display/US/I-2020-07-02-Pawsey

Posted Jul 02, 2020 - 15:34 AWST

This incident affected: ASKAP (ASKAP ingest nodes), Storage Systems (Banksia), and Lustre filesystems (/askapbuffer filesystem).