/scratch Object Storage Server (OSS) reboot

Incident Report for Pawsey Supercomputing Research Centre

Resolved

Resources have been rebalanced to their optimal configuration. HPE have collected logs from the filesystem and will provide Pawsey a root cause analysis in due course.

Posted Jul 04, 2025 - 08:21 AWST

Monitoring

Failover has been completed
* Resource has been restored to the nominal High Availability pair
* It will monitored

Posted Jul 03, 2025 - 21:19 AWST

Investigating

It appears we have a slow Object Storage Server (OSS) serving up part of /scratch. The HPE engineers are going to failover resources to the partner OSS to allow them to reboot the OSS. There will be a brief pause in access to /scratch as clients reconnect.

Please be aware that the /scratch filesystem is becoming increasingly full. We would appreciate the assistance of researchers in removing unused files from the filesystem to ensure it is accessible to all.

Posted Jul 03, 2025 - 14:28 AWST

This incident affected: Lustre filesystems (/scratch filesystem).