The filesystem is operating in a nominal configuration. HPE's recommended "fix" is to upgrade the firmware on the filesystem and we are still waiting for a root cause analysis
Posted May 16, 2023 - 11:49 AWST
We are still waiting for a root cause for the failure, but /scratch has been operating normally over the last week. We have been advised that the "fix" is the upgrade to the latest version of firmware, but we will continue to monitor the filesystem.
Posted May 15, 2023 - 08:00 AWST
HPE suspect that the high load may have set the OSS's to INACTIVE on the MDS. They have reactivated the OSSes on the MDS, which appears to have resolved researchers' issues with accessing /scratch.
Posted May 08, 2023 - 11:59 AWST
The issue with /scratch seems to be heavily impacting the setonix compute nodes. Health checks are taking nodes offline leading to reduced availability for compute nodes.
Posted May 07, 2023 - 15:12 AWST
HPE are going to "crash" scratch1n47 to collect a core dump for analysis. There will be a brief interruption to /scratch while this happens.
Posted May 04, 2023 - 11:47 AWST
We have noticed one of the Object Store Servers (OSSes) which a high load which provides /scratch on Setonix.
A critical support call was lodged on the 20th April 2023 with HPE. We have received no response from the vendor.
Posted May 04, 2023 - 08:48 AWST
This incident affected: Lustre filesystems (/scratch filesystem (new)) and Setonix (Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition).