/scratch Object Storage Server (OSS) - Failover / reboot

Incident Report for Pawsey Supercomputing Research Centre

Resolved

After monitoring for a few days we haven't seen a reoccurrence of the issues.

Posted Jul 18, 2025 - 16:05 AWST

Update

Following Pawsey staff disabling "fallocate" last week and the subsequent identification of a failing disk, performance on the scratch filesystem has improved however the filesystem is now becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all.

Posted Jul 15, 2025 - 11:23 AWST

Update

HPE has informed Pawsey that recent issues with /scratch communicated by Pawsey in last couple of days, are caused by Lustre bug related to the use of "fallocate". Pawsey has disabled "fallocate" on /scratch following tests performed on Setonix's Test and Development System (TDS). Workloads using fallocate may experience a minor performance hit, non-zero return codes, and files not pre-allocating on the filesystem. Most of the researchers should not experience any significant changes to their jobs and given the issues we have had some will improve their speed. However, researchers are asked to contact Pawsey through Help Desk in case of any issues.

Posted Jul 09, 2025 - 15:50 AWST

Update

Storage Node Failover
* A different storage node has been identified with the same high load issues
* A HA failover will be performed on the node
* There will be a slight pause in /scratch

Posted Jul 09, 2025 - 09:55 AWST

Monitoring

Failover has completed by the Engineers
* High availability resources resources has been restored to the original nodes
* System IO load looks nominal
* Logs / Dump is being submitted to Vendor Engineers for analysis

Posted Jul 07, 2025 - 14:12 AWST

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 07, 2025 - 11:25 AWST

Investigating

Investigating - An Object Storage Server (OSS) are showing the same symptoms to a previous detected server which is slow in response.

The HPE engineers are going to failover resources to the partner OSS to allow them to reboot the OSS. There will be a brief pause in access to /scratch as clients reconnect.

Please be aware that the /scratch filesystem is becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all

Posted Jul 07, 2025 - 10:50 AWST

This incident affected: Lustre filesystems (/scratch filesystem).