Setonix Scratch - Degraded Performance

Incident Report for Pawsey Supercomputing Research Centre

Resolved

The filesystem is operating in a nominal configuration. HPE's recommended "fix" is to upgrade the firmware on the filesystem and we are still waiting for a root cause analysis

Posted May 16, 2023 - 11:49 AWST

Monitoring

We are still waiting for a root cause for the failure, but /scratch has been operating normally over the last week. We have been advised that the "fix" is the upgrade to the latest version of firmware, but we will continue to monitor the filesystem.

Posted May 15, 2023 - 08:00 AWST

Update

HPE suspect that the high load may have set the OSS's to INACTIVE on the MDS. They have reactivated the OSSes on the MDS, which appears to have resolved researchers' issues with accessing /scratch.

Posted May 08, 2023 - 11:59 AWST

Update

The issue with /scratch seems to be heavily impacting the setonix compute nodes. Health checks are taking nodes offline leading to reduced availability for compute nodes.

Posted May 07, 2023 - 15:12 AWST

Update

HPE are going to "crash" scratch1n47 to collect a core dump for analysis. There will be a brief interruption to /scratch while this happens.

Posted May 04, 2023 - 11:47 AWST

Investigating

We have noticed one of the Object Store Servers (OSSes) which a high load which provides /scratch on Setonix.

A critical support call was lodged on the 20th April 2023 with HPE. We have received no response from the vendor.

Posted May 04, 2023 - 08:48 AWST

This incident affected: Lustre filesystems (/scratch filesystem) and Setonix (Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition).