/scratch Object Storage Server (OSS) - Failover / reboot

Update

HPE has informed Pawsey that recent issues with /scratch communicated by Pawsey in last couple of days, are caused by Lustre bug related to the use of "fallocate". Pawsey has disabled "fallocate" on /scratch following tests performed on Setonix's Test and Development System (TDS). Workloads using fallocate may experience a minor performance hit, non-zero return codes, and files not pre-allocating on the filesystem. Most of the researchers should not experience any significant changes to their jobs and given the issues we have had some will improve their speed. However, researchers are asked to contact Pawsey through Help Desk in case of any issues.
Posted Jul 09, 2025 - 15:50 AWST

Update

Storage Node Failover
* A different storage node has been identified with the same high load issues
* A HA failover will be performed on the node
* There will be a slight pause in /scratch
Posted Jul 09, 2025 - 09:55 AWST

Monitoring

Failover has completed by the Engineers
* High availability resources resources has been restored to the original nodes
* System IO load looks nominal
* Logs / Dump is being submitted to Vendor Engineers for analysis
Posted Jul 07, 2025 - 14:12 AWST

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 07, 2025 - 11:25 AWST

Investigating

Investigating - An Object Storage Server (OSS) are showing the same symptoms to a previous detected server which is slow in response.

The HPE engineers are going to failover resources to the partner OSS to allow them to reboot the OSS. There will be a brief pause in access to /scratch as clients reconnect.

Please be aware that the /scratch filesystem is becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all
Posted Jul 07, 2025 - 10:50 AWST
This incident affects: Lustre filesystems (/scratch filesystem).