Identified - The internal login node on galaxy (galaxy-int) that is used for interactive slurm jobs has been logging PCIe bus errors recently. HPE/Cray staff are aware and will investigate further. Although compute blades can be removed for maintenance in a warm-swap operation, the internal login node (c0-0c0s0n2) sits on the same blade as the internal boot node (c0-0c0s0n1) for the system, meaning repair may be a more disruptive process.
Nov 07, 2022 - 06:28 AWST
The work to connect Phase 2 of the Setonix system and implement the upgraded Cassini cards has been completed and Phase 1 has now been returned to service.
While we understand researchers frustration at the delay, the work was required to ensure Setonix was restored as a stable system.
Pawsey staff and our vendor partners continue to work on Phase 2 to bring that into service within the February timeframe.
Please note the February maintenance window for Setonix will be extended to 2 days (Tuesday 7th and Wednesday 8th of February) to allow sufficient time to make configuration changes to Phase 1 to allow Phase 2 to be brought on-line.
We appreciate your support and ask if you have any questions please e-mail email@example.com.
Jan 25, 15:09 AWST
Our vendor is working on bringing Setonix back into service in it's final configuration. The remaining services at Pawsey should all be operational. If you are having issues, please create a ticket via https://pawsey.org.au/support/
While Setonix was offline, Pawsey took the opportunity to request HPE to replace all the network cards in Phase 1 with the upgraded Cassini cards, these are the same network cards which are in Phase 2. This is a significant improvement to the networking capacity and capability with higher bandwidth (200 Gb/s) and come with an updated libfabric which resolves many of the MPI issues encountered with Phase 1. HPE are rectifying issues with the high-speed fabric, which is taking longer than expected, this work is required to ensure Phase 1 is restored in a resilient fashion.
While we understand researchers frustration at the delay, the work currently being done will ensure that Setonix is restored as a stable system as well as allowing us to bring Phase 2 into production as soon as possible.
Jan 19, 12:06 AWST
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 13, 08:00 AWST
Over the weekend of 14/15 January, the high voltage supply switchgear for Pawsey will undergo its annual inspection and maintenance. This requires that the supplies are isolated and so ALL Pawsey systems will be shut down for the weekend. Systems will be shut down from Friday at 08:00 and startup resumed on Monday morning. We are expecting systems to be restored by the evening of Tuesday the 17th.
Nov 29, 08:25 AWST