reduced availability of some nodes

Resolved

Closing incident - all seems stable
Posted Dec 21, 2021 - 06:09 AWST

Monitoring

The replacement switch has been fitted and nodes are being brought back online
Posted Dec 08, 2021 - 10:58 AWST

Identified

Due to a hardware failure with an infiniband switch, the 'nvlinkq' partition on Topaz and the 'copyq' partition on Garrawarla are presently unavailable. We are awaiting a replacement switch from the vendor, but do not have an ETA at this time.
Posted Nov 18, 2021 - 15:24 AWST