Pawsey Supercomputing Research Centre
Identified - The internal login node on galaxy (galaxy-int) that is used for interactive slurm jobs has been logging PCIe bus errors recently. HPE/Cray staff are aware and will investigate further. Although compute blades can be removed for maintenance in a warm-swap operation, the internal login node (c0-0c0s0n2) sits on the same blade as the internal boot node (c0-0c0s0n1) for the system, meaning repair may be a more disruptive process.
Nov 07, 2022 - 06:28 AWST
Setonix Major Outage
Login nodes ? Under Maintenance
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Degraded Performance
Setonix long partition Major Outage
Setonix copy partition Degraded Performance
Setonix askaprt partition Operational
Setonix highmem partition Operational
Lustre filesystems Operational
90 days ago
100.0 % uptime
Today
/scratch filesystem (new) ? Operational
90 days ago
100.0 % uptime
Today
/software filesystem ? Operational
90 days ago
100.0 % uptime
Today
/scratch filesystem (legacy) ? Operational
90 days ago
100.0 % uptime
Today
/group filesystem ? Operational
90 days ago
100.0 % uptime
Today
/astro filesystem ? Operational
90 days ago
100.0 % uptime
Today
/askapbuffer filesystem ? Operational
90 days ago
100.0 % uptime
Today
/askapingest filesystem ? Operational
90 days ago
100.0 % uptime
Today
Storage Systems Operational
90 days ago
100.0 % uptime
Today
Acacia ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Ingest Operational
MWA ASVO ? Operational
90 days ago
100.0 % uptime
Today
Topaz Operational
GPU partition ? Operational
Topaz login nodes Operational
Slurm Controller (topaz) ? Operational
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Garrawarla Operational
Garrawarla workq partition ? Operational
Garrawarla gpuq partition ? Operational
Garrawarla asvoq partition ? Operational
Garrawarla copyq partition ? Operational
Garrawarla login node Operational
Slurm Controller (Garrawarla) Operational
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Central Services Operational
90 days ago
100.0 % uptime
Today
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
90 days ago
100.0 % uptime
Today
Visualisation Services Operational
Remote Vis ? Operational
Nebula ? Operational
Visualisation Lab Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Legacy Systems Operational
Galaxy Compute nodes ? Operational
Galaxy login nodes ? Operational
Slurm Controller (Galaxy) Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Allocated Cores (Setonix work partition)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Allocated Nodes (Galaxy workq) ?
Fetching
Allocated nodes (Garrawarla workq partition)
Fetching
Past Incidents
Feb 3, 2023
Resolved - This incident has been resolved and the Pawsey Data Portal is ready again for use. Apologies for this unplanned outage.
Feb 3, 17:41 AWST
Investigating - We are currently investigating this issue.
Feb 3, 13:54 AWST
Feb 2, 2023

No incidents reported.

Feb 1, 2023

No incidents reported.

Jan 31, 2023

No incidents reported.

Jan 30, 2023

No incidents reported.

Jan 29, 2023

No incidents reported.

Jan 28, 2023

No incidents reported.

Jan 27, 2023

No incidents reported.

Jan 26, 2023

No incidents reported.

Jan 25, 2023
Completed - The work to connect Phase 2 of the Setonix system and implement the upgraded Cassini cards has been completed and Phase 1 has now been returned to service.

While we understand researchers frustration at the delay, the work was required to ensure Setonix was restored as a stable system.

Pawsey staff and our vendor partners continue to work on Phase 2 to bring that into service within the February timeframe.

Please note the February maintenance window for Setonix will be extended to 2 days (Tuesday 7th and Wednesday 8th of February) to allow sufficient time to make configuration changes to Phase 1 to allow Phase 2 to be brought on-line.

We appreciate your support and ask if you have any questions please e-mail help@pawsey.org.au.

Jan 25, 15:09 AWST
Update - Our vendor is working on bringing Setonix back into service in it's final configuration. The remaining services at Pawsey should all be operational. If you are having issues, please create a ticket via https://pawsey.org.au/support/

While Setonix was offline, Pawsey took the opportunity to request HPE to replace all the network cards in Phase 1 with the upgraded Cassini cards, these are the same network cards which are in Phase 2. This is a significant improvement to the networking capacity and capability with higher bandwidth (200 Gb/s) and come with an updated libfabric which resolves many of the MPI issues encountered with Phase 1. HPE are rectifying issues with the high-speed fabric, which is taking longer than expected, this work is required to ensure Phase 1 is restored in a resilient fashion.

While we understand researchers frustration at the delay, the work currently being done will ensure that Setonix is restored as a stable system as well as allowing us to bring Phase 2 into production as soon as possible.

Jan 19, 12:06 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 13, 08:00 AWST
Scheduled - Over the weekend of 14/15 January, the high voltage supply switchgear for Pawsey will undergo its annual inspection and maintenance. This requires that the supplies are isolated and so ALL Pawsey systems will be shut down for the weekend.
Systems will be shut down from Friday at 08:00 and startup resumed on Monday morning. We are expecting systems to be restored by the evening of Tuesday the 17th.

Nov 29, 08:25 AWST
Jan 24, 2023

No incidents reported.

Jan 23, 2023

No incidents reported.

Jan 22, 2023

No incidents reported.

Jan 21, 2023

No incidents reported.

Jan 20, 2023

No incidents reported.