Pawsey Supercomputing Research Centre
Update - HPE has informed Pawsey that recent issues with /scratch communicated by Pawsey in last couple of days, are caused by Lustre bug related to the use of "fallocate". Pawsey has disabled "fallocate" on /scratch following tests performed on Setonix's Test and Development System (TDS). Workloads using fallocate may experience a minor performance hit, non-zero return codes, and files not pre-allocating on the filesystem. Most of the researchers should not experience any significant changes to their jobs and given the issues we have had some will improve their speed. However, researchers are asked to contact Pawsey through Help Desk in case of any issues.
Jul 09, 2025 - 15:50 AWST
Update - Storage Node Failover
* A different storage node has been identified with the same high load issues
* A HA failover will be performed on the node
* There will be a slight pause in /scratch

Jul 09, 2025 - 09:55 AWST
Monitoring - Failover has completed by the Engineers
* High availability resources resources has been restored to the original nodes
* System IO load looks nominal
* Logs / Dump is being submitted to Vendor Engineers for analysis

Jul 07, 2025 - 14:12 AWST
Identified - The issue has been identified and a fix is being implemented.
Jul 07, 2025 - 11:25 AWST
Investigating - Investigating - An Object Storage Server (OSS) are showing the same symptoms to a previous detected server which is slow in response.

The HPE engineers are going to failover resources to the partner OSS to allow them to reboot the OSS. There will be a brief pause in access to /scratch as clients reconnect.

Please be aware that the /scratch filesystem is becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all

Jul 07, 2025 - 10:50 AWST
Setonix Operational
Login nodes ? Operational
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Operational
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
Setonix gpu high mem partition Operational
Setonix gpu debug partition Operational
Lustre filesystems Operational
/scratch filesystem ? Operational
/software filesystem ? Operational
/askapbuffer filesystem ? Operational
/askapingest filesystem ? Operational
Storage Systems Operational
Acacia - Projects ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Acacia - Ingest ? Operational
MWA ASVO ? Operational
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Central Services Operational
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
Documentation ? Operational
Visualisation Services Operational
Remote Vis ? Operational
Vis scheduler ? Operational
Setonix vis nodes ? Operational
Nebula vis nodes ? Operational
Visualisation Lab Operational
Reservation ? Operational
CARTA - Stable ? Operational
CARTA - Test ? Operational
Pawsey Remote VR Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Nimbus - Legacy Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Nimbus APIs ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance

Scheduled Maintenance

Acacia Ingest maintenance Jul 15, 2025 09:00-17:00 AWST

We will be upgrading the Ceph cluster to a new version. Services will remain available for the whole duration, although the system will be considered at risk.
Posted on Jul 08, 2025 - 12:22 AWST

Banksia Spectralogic Annual Tape Library Cleans - Reduced capacity Jul 21, 2025 08:00 - Jul 25, 2025 17:00 AWST

We will be undertaking non-disrupting Banksia tape library maintenance at this time, so available tape copies will be reduced from two to one.
Posted on Jul 04, 2025 - 07:58 AWST
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Jul 9, 2025

Unresolved incident: /scratch Object Storage Server (OSS) - Failover / reboot.

Jul 8, 2025

No incidents reported.

Jul 7, 2025
Jul 6, 2025

No incidents reported.

Jul 5, 2025

No incidents reported.

Jul 4, 2025
Resolved - Resources have been rebalanced to their optimal configuration. HPE have collected logs from the filesystem and will provide Pawsey a root cause analysis in due course.
Jul 4, 08:21 AWST
Monitoring - Failover has been completed
* Resource has been restored to the nominal High Availability pair
* It will monitored

Jul 3, 21:19 AWST
Investigating - It appears we have a slow Object Storage Server (OSS) serving up part of /scratch. The HPE engineers are going to failover resources to the partner OSS to allow them to reboot the OSS. There will be a brief pause in access to /scratch as clients reconnect.

Please be aware that the /scratch filesystem is becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all.

Jul 3, 14:28 AWST
Jul 3, 2025
Jul 2, 2025
Resolved - This incident has been resolved.
Jul 2, 11:19 AWST
Monitoring - Automatic QoS changes for accounts that have exceeded allocation are now running again. Staff have identified a few users who may experience 'InvalidQOS' for jobs and will work on a resolution shortly.
Jul 2, 08:19 AWST
Investigating - There was an issue with the quarterly automatic reset overnight of Slurm priorities for projects who had exceeded their allocation in Q2. Staff are working on a resolution and jobs can continue to be submitted and will run as resources are available. Once we have resolved the issue, any jobs which have the incorrect QoS will be automatically updated.
Jul 1, 16:34 AWST
Jul 1, 2025
Resolved - The issue with Staging is resolved. Staging is continuing.
Jul 1, 12:43 AWST
Monitoring - A fix has been implemented and we are monitoring the results.
Jul 1, 12:40 AWST
Identified - The root cause appears to have been identified by our vendor and we need to take Banksia offline immediately to implement the required fix.
Jul 1, 10:59 AWST
Investigating - Currently there is an issue affecting file staging - only a portion of requests are succeeding. It's is recommended not to use Banks9a at this time.
Jul 1, 10:58 AWST
Jun 30, 2025

No incidents reported.

Jun 29, 2025

No incidents reported.

Jun 28, 2025

No incidents reported.

Jun 27, 2025

No incidents reported.

Jun 26, 2025

No incidents reported.

Jun 25, 2025
Completed - The scheduled maintenance has been completed.
Jun 25, 17:00 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jun 25, 08:00 AWST
Scheduled - Acacia Projects and Acacia Ingest will continue their move to newer operating systems via a rolling upgrade. There is no planned service outage, but the availability will be considered at risk.
Jun 24, 16:35 AWST
Completed - Setonix was handed back to Pawsey at 8 AM this morning. We (Pawsey) have rebooted the entire system and run our usual battery of tests. As usual a small number of nodes (including visualisation nodes) have been put into reservations so hardware issues can be triaged and resolved.

Setonix has been returned to service.

The Pawsey team will issue separate communication about the current configuration and the use of NVMe devices on Setonix GPU nodes.

Acacia is still having a rolling upgrade of its operating system which is being tracked in a seperate maintenance page (https://status.pawsey.org.au/incidents/39nvt00xyhm4)

The next scheduled maintenance will be 5th August 2025. Setonix will be upgraded to the next extended support release of Cray Operating System which is based on SLES 15 SP6.

Thank you to everyone involved in maintenance for their hard work.

If you have any questions, please contact help@pawsey.org.au. Ask nicely.

Jun 25, 13:14 AWST
Update - Apologies about the AWSET, it is where my brain currently is ....
Jun 24, 16:09 AWST
Update - Banksia has been returned to service (except for Kafka notifications which will be returned to service later this afternoon). The ScoutAM update, storage controller firmware update and tape library firmware update have been completed. A storage controller will need to be replaced, but will be done live.

Acacia (MWA) intrusive testing work is complete.

Acacia (Projects) and Acacia (Ingest) operating system upgrades are 1/3 complete. The service has been available throughout, but the availability is considered at risk. The upgrades will continue tomorrow, and a seperate maintenance page will be opened.

Operating System updates to the SLURM database daemon, patching of visualisation services and patching of core Pawsey services are complete.

ASKAP Ingest has also been returned to service.

HPE continue to work on Setonix. We hope to have an update at 5 PM (AWST). However we still expect HPE to hand Setonix back to Pawsey sometime tomorrow.

CARTA and Setonix Visualisation nodes are dependent on HPE handing Setonix back to Pawsey, before they can be put back into production.

Jun 24, 16:06 AWST
Update - Patching visualisation services is complete. CARTA and Setonix Visualisation nodes are waiting on HPE to return to service.

Patching of core Pawsey services is complete.

Jun 24, 15:22 AWST
Update - The Setonix management control plane has been upgraded to HPCM 1.13. The system has been handed over to the local HPE team to perform remediation work due to the extended work last maintenance (May). The system will be handed back to the remote team this evening to complete the management control plane upgrade.

Upgrades on other systems have commenced.

Jun 24, 08:47 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jun 23, 16:00 AWST
Update - The Setonix maintenance has been pushed forward to 4PM to allow Pawsey staff to safely shutdown services before handing the system over to HPE.
Jun 22, 13:14 AWST
Update - Acacia (Projects) and Acacia (Ingest) will move to newer operating systems via a rolling upgrade. There is no planned service outage, but the availability will be considered at risk.
Jun 19, 17:53 AWST
Scheduled - Maintenance will be carried out on Setonix starting at 5 PM, Monday June 23rd 2025. This is to support HPE engineers who will be performing the HPE Performance Cluster Manager (HPCM) upgrade, who are in a different time zone to Perth.

Maintenance will be carried out on all other Pawsey systems on Tuesday the 24th June to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

Planned work for this window includes:
• Update of NVMe SLURM gres configuration on Setonix GPU nodes
• Banksia ScoutAM update
• Banksia storage controller firmware update
• Banksia tape library firmware update
• Acacia (MWA) will be rebooted to test network changes "stick"
• Acacia (Projects) and Acacia (Ingest) will move to newer operating systems via a rolling upgrade. There is no planned service outage, but the availability will be considered at risk.
• Operating System updates to the SLURM database daemon
• Patching visualisation services
• Patching of core Pawsey services

We expect to be able to bring all services (except Setonix) back by the end of the day. Setonix is scheduled to be handed back to Pawsey early Wednesday morning (June 25th 2025).

The next scheduled maintenance will be 5th August 2025. Setonix will be upgraded to the next extended support release of Cray Operating System which is based on SLES 15 SP6.

If you have any questions, please contact help@pawsey.org.au.

Jun 17, 14:18 AWST