Pawsey Supercomputing Research Centre Status

/scratch Object Storage Server (OSS) - Failover / reboot

Update - HPE has informed Pawsey that recent issues with /scratch communicated by Pawsey in last couple of days, are caused by Lustre bug related to the use of "fallocate". Pawsey has disabled "fallocate" on /scratch following tests performed on Setonix's Test and Development System (TDS). Workloads using fallocate may experience a minor performance hit, non-zero return codes, and files not pre-allocating on the filesystem. Most of the researchers should not experience any significant changes to their jobs and given the issues we have had some will improve their speed. However, researchers are asked to contact Pawsey through Help Desk in case of any issues.
Jul 09, 2025 - 15:50 AWST

Update - Storage Node Failover
* A different storage node has been identified with the same high load issues
* A HA failover will be performed on the node
* There will be a slight pause in /scratch
Jul 09, 2025 - 09:55 AWST

Monitoring - Failover has completed by the Engineers
* High availability resources resources has been restored to the original nodes
* System IO load looks nominal
* Logs / Dump is being submitted to Vendor Engineers for analysis
Jul 07, 2025 - 14:12 AWST

Identified - The issue has been identified and a fix is being implemented.
Jul 07, 2025 - 11:25 AWST

Investigating - Investigating - An Object Storage Server (OSS) are showing the same symptoms to a previous detected server which is slow in response.

The HPE engineers are going to failover resources to the partner OSS to allow them to reboot the OSS. There will be a brief pause in access to /scratch as clients reconnect.

Please be aware that the /scratch filesystem is becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all
Jul 07, 2025 - 10:50 AWST

Setonix Operational

Data-mover nodes Operational

Slurm scheduler Operational

Setonix work partition Operational

Setonix debug partition Operational

Setonix long partition Operational

Setonix copy partition Operational

Setonix askaprt partition Operational

Setonix highmem partition Operational

Setonix gpu partition Operational

Setonix gpu high mem partition Operational

Setonix gpu debug partition Operational

Lustre filesystems Operational

/scratch filesystem Operational

/software filesystem Operational

/askapbuffer filesystem Operational

/askapingest filesystem Operational

Storage Systems Operational

Acacia - Projects Operational

Banksia Operational

Data Portal Systems Operational

MWA Nodes Operational

CASDA Nodes Operational

Acacia - Ingest Operational

MWA ASVO Operational

ASKAP Operational

ASKAP ingest nodes Operational

ASKAP service nodes Operational

Central Services Operational

Authentication and Authorization Operational

Service Desk Operational

License Server Operational

Application Portal Operational

Origin Operational

/home filesystem Operational

/pawsey filesystem Operational

Central Slurm Database Operational

Documentation Operational

Visualisation Services Operational

Remote Vis Operational

Vis scheduler Operational

Setonix vis nodes Operational

Nebula vis nodes Operational

Visualisation Lab Operational

Reservation Operational

CARTA - Stable Operational

CARTA - Test Operational

Pawsey Remote VR Operational

The Australian Biocommons Operational

Fgenesh++ Operational

Nimbus - Legacy Operational

Ceph storage Operational

Nimbus instances Operational

Nimbus dashboard Operational

Nimbus APIs Operational

Operational

Degraded Performance

Partial Outage

Major Outage

Maintenance

Scheduled Maintenance

Acacia Ingest maintenance Jul `15`, `2025` `09:00`-`17:00` AWST

We will be upgrading the Ceph cluster to a new version. Services will remain available for the whole duration, although the system will be considered at risk.
Posted on Jul 08, 2025 - 12:22 AWST

Banksia Spectralogic Annual Tape Library Cleans - Reduced capacity Jul `21`, `2025` `08:00` - Jul `25`, `2025` `17:00` AWST

We will be undertaking non-disrupting Banksia tape library maintenance at this time, so available tape copies will be reduced from two to one.
Posted on Jul 04, 2025 - 07:58 AWST

System Metrics Month Week Day

Allocated Cores (Setonix)

Fetching

Allocated Nodes (Setonix work partition)

Fetching

Allocated nodes (Setonix askaprt partition)

Fetching

Past Incidents

Jul 12, 2025

No incidents reported today.

Jul 11, 2025

No incidents reported.

Jul 10, 2025

No incidents reported.

Jul 9, 2025

Unresolved incident: /scratch Object Storage Server (OSS) - Failover / reboot.

Jul 8, 2025

No incidents reported.

Jul 7, 2025

Jul 6, 2025

No incidents reported.

Jul 5, 2025

No incidents reported.

Jul 4, 2025

/scratch Object Storage Server (OSS) reboot

Resolved - Resources have been rebalanced to their optimal configuration. HPE have collected logs from the filesystem and will provide Pawsey a root cause analysis in due course.
Jul 4, 08:21 AWST

Monitoring - Failover has been completed
* Resource has been restored to the nominal High Availability pair
* It will monitored
Jul 3, 21:19 AWST

Investigating - It appears we have a slow Object Storage Server (OSS) serving up part of /scratch. The HPE engineers are going to failover resources to the partner OSS to allow them to reboot the OSS. There will be a brief pause in access to /scratch as clients reconnect.

Please be aware that the /scratch filesystem is becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all.
Jul 3, 14:28 AWST

Jul 3, 2025

Jul 2, 2025

Issue with rollover to Q3 slurm priorities

Resolved - This incident has been resolved.
Jul 2, 11:19 AWST

Monitoring - Automatic QoS changes for accounts that have exceeded allocation are now running again. Staff have identified a few users who may experience 'InvalidQOS' for jobs and will work on a resolution shortly.
Jul 2, 08:19 AWST

Investigating - There was an issue with the quarterly automatic reset overnight of Slurm priorities for projects who had exceeded their allocation in Q2. Staff are working on a resolution and jobs can continue to be submitted and will run as resources are available. Once we have resolved the issue, any jobs which have the incorrect QoS will be automatically updated.
Jul 1, 16:34 AWST

Jul 1, 2025

Banksia - Service issues affecting staging

Resolved - The issue with Staging is resolved. Staging is continuing.
Jul 1, 12:43 AWST

Monitoring - A fix has been implemented and we are monitoring the results.
Jul 1, 12:40 AWST

Identified - The root cause appears to have been identified by our vendor and we need to take Banksia offline immediately to implement the required fix.
Jul 1, 10:59 AWST

Investigating - Currently there is an issue affecting file staging - only a portion of requests are succeeding. It's is recommended not to use Banks9a at this time.
Jul 1, 10:58 AWST

Jun 30, 2025

No incidents reported.

Jun 29, 2025

No incidents reported.

Jun 28, 2025

No incidents reported.

Scheduled Maintenance

Acacia Ingest maintenance Jul 15, 2025 09:00-17:00 AWST

Banksia Spectralogic Annual Tape Library Cleans - Reduced capacity Jul 21, 2025 08:00 - Jul 25, 2025 17:00 AWST

Past Incidents

Acacia Ingest maintenance Jul `15`, `2025` `09:00`-`17:00` AWST

Banksia Spectralogic Annual Tape Library Cleans - Reduced capacity Jul `21`, `2025` `08:00` - Jul `25`, `2025` `17:00` AWST