reduced availability of some nodes
Incident Report for Pawsey Supercomputing Centre
Monitoring
The replacement switch has been fitted and nodes are being brought back online
Posted Dec 08, 2021 - 10:58 AWST
Identified
Due to a hardware failure with an infiniband switch, the 'nvlinkq' partition on Topaz and the 'copyq' partition on Garrawarla are presently unavailable. We are awaiting a replacement switch from the vendor, but do not have an ETA at this time.
Posted Nov 18, 2021 - 15:24 AWST
This incident affects: Garrawarla (Garrawarla compute nodes) and Topaz (GPU partition).