Pawsey Supercomputing Research Centre
Update - The CDU1 failed again this morning.

HPE discovered the expansion tank had no pressure. They have pumped the bladder tank to 18PSI.

All cabinets, chassis and nodes are now powered up and nodes are booted.

We will pray.

Sep 03, 2025 - 11:37 AWST
Monitoring - HPE have powered back on the affected racks, and all nodes bar one have passed testing.

Pawsey will ask for a Post Incident Report from HPE and monitor the system overnight.

Sep 02, 2025 - 17:50 AWST
Investigating - CDU 1 has failed in Setonix which has resulting in Racks X1002, X1003 and X1004 off lining all nodes in those racks.

Approximately half of the GPU capacity and one third of the CPU capacity of Setonix is unavailable.

HPE are onsite and investigating, but have no ETA on when the cooling unit will be restored to service.

Sep 02, 2025 - 17:10 AWST
Setonix Operational
Login nodes ? Operational
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Operational
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
Setonix gpu high mem partition Operational
Setonix gpu debug partition Operational
Lustre filesystems Operational
/scratch filesystem ? Operational
/software filesystem ? Operational
/askapbuffer filesystem ? Operational
/askapingest filesystem ? Operational
Storage Systems Operational
Acacia Ingest ? Operational
Acacia MWA ? Operational
Acacia Projects ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
CASDA Nodes Operational
MWA Nodes Operational
MWA ASVO ? Operational
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Central Services Operational
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
Documentation ? Operational
Visualisation Services Operational
Remote Vis ? Operational
Vis scheduler ? Operational
Setonix vis nodes ? Operational
Nebula vis nodes ? Operational
Visualisation Lab Operational
Reservation ? Operational
CARTA - Stable ? Operational
CARTA - Test ? Operational
Pawsey Remote VR Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Nimbus - Legacy Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Nimbus APIs ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Sep 3, 2025

Unresolved incident: Cooling Unit (CDU) 1 has failed in Setonix.

Sep 2, 2025
Completed - All Pawsey services have been returned to production.

Please e-mail help@pawsey.org.au if you require assistance.

Sep 2, 16:00 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 2, 08:00 AWST
Scheduled - Maintenance will be carried out on Pawsey systems on Tuesday the 2nd September to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

Planned work for this window includes:
• Implementation of Firewall-Border re-design (this will affect all network connectivity to all systems at Pawsey, including Acacia, Nimbus, Banksia, Nebula and Setonix).
• Setonix will have the latest bug and security fixes applied from SUSE Linux Enterprise Server 15 SP6.
• Setonix will have the latest "extended support" versions of HPE Slingshot Host Software and HPE User Services Software applied.
• Limits on the gpu partition on Setonix will be updated: max number of concurrent jobs per user will be set to 64 and max number of jobs submitted per user will be set to 1024 for all users.
• Limits on the gpu-highmem partition on Setonix will be updated: max number of concurrent jobs per user will be set to 8 and max number of jobs submitted per user will be set to 256 for all users.
• Banksia will have a ScoutAM upgrade.
• Banksia tape library firmware update.
• Patching of visualisation services will be undertaken.
• Patching of core Pawsey services will be undertaken.

Aug 25, 16:05 AWST
Sep 1, 2025

No incidents reported.

Aug 31, 2025

No incidents reported.

Aug 30, 2025

No incidents reported.

Aug 29, 2025

No incidents reported.

Aug 28, 2025

No incidents reported.

Aug 27, 2025

No incidents reported.

Aug 26, 2025
Completed - PreEmptive Maintenance has been completed
* askap-ingest[01-18] are using HPE hardware based ECC correction / detection
* 5 x DIMMs has been replaced in the compute nodes

Aug 26, 09:41 AWST
Scheduled - Askap Ingest Cluster will undergo ProActive Maintenance
* Askap Ingest Compute nodes will switch to HPE hardware based memory error correction / detection vs inline Kernel error correction / detection as with HPE recommendation
* Proactive replacement of DIMMs (x5) which could prove problematic in the future

Aug 26, 08:00 AWST
Aug 25, 2025

No incidents reported.

Aug 24, 2025

No incidents reported.

Aug 23, 2025

No incidents reported.

Aug 22, 2025

No incidents reported.

Aug 21, 2025

No incidents reported.

Aug 20, 2025

No incidents reported.