High speed interconnect issue
Incident Report for Pawsey Supercomputing Centre
Update
We are still awaiting the replacement part. The expectation is it should arrive in time for scheduled maintenance on Tuesday.
Posted May 27, 2020 - 08:45 AWST
Update
Todays update from the vendor: - they have requested an escalation from the Intel Support team to expedite on the RMA process, but we still don't have any parts on site or an ETA of them arriving.
Posted May 14, 2020 - 13:37 AWST
Update
We are still waiting for replacement parts to be shipped to site from our vendor.
Posted May 12, 2020 - 12:00 AWST
Update
We are in contact with the vendor requesting replacement parts. At this stage, we have not been able to secure an ETA from the vendor.
Posted May 07, 2020 - 09:39 AWST
Identified
We've identified the failed component and the Omni-Path fabric is stable. The Omni-Path connected Lustre clients are in recovery mode and may take some time to complete.
Posted May 06, 2020 - 08:58 AWST
Update
We are investigating an issue with the Omnipath switch. This affects certain Zeus partitions. Please see the incident log page for more information: https://support.pawsey.org.au/documentation/display/US/I-2020-05-05-SC
Posted May 05, 2020 - 14:41 AWST
Investigating
We are investigating an issue with the Omnipath switch.
Posted May 05, 2020 - 14:05 AWST
This incident affects: Zeus (Zeus login node, Zeus Compute nodes).