This outage was caused by the total load on one of the switchboards (that provides static UPS backed power to the supercompute cell) exceeding the adjustable trip limit during ‘burn in’ testing of the new Garrawarla cluster.
This has been remediated by switching the new Garrawarla compute nodes to an alternative supply, and the adjustable limit will be increased during the upcoming September maintenance window, which will require a shutdown of all equipment connected to that distribution board, and any services that depend on that equipment.
Posted Aug 20, 2020 - 15:36 AWST
Resolved
This incident has been resolved.
Posted Jul 08, 2020 - 10:44 AWST
Monitoring
Supercomputing clusters: Magnus, Galaxy, Zeus, and Topaz, along with lustre filesystems, have returned to operation.
Posted Jul 03, 2020 - 16:41 AWST
Identified
One of the lustre filesystems (/astro) isn't coming back online cleanly. A call has been lodged with the vendor and we're awaiting an update from them before proceeding. Supercompute services will remain offline until the filesystem issue is addressed.
Posted Jul 02, 2020 - 21:05 AWST
Update
All the affected services are fed via a single sub-board. CSIRO facilities staff are on their way to site to investigate cause and Pawsey staff will commence service recovery when given the all-clear