Scratch Metadataserver offline

Incident Report for Pawsey Supercomputing Research Centre

Resolved

We have seen no further issues with the meta data servers. We are still waiting for a root cause analysis from HPE.

Posted Sep 26, 2023 - 15:26 AWST

Monitoring

Both meta data targets have been remounted by HPE engineers and they are monitoring the system. A root cause of the issue is currently under investigation.

Posted Sep 25, 2023 - 14:25 AWST

Update

Today is a public holiday in Western Australia, however we are still monitoring this incident and awaiting an update from our vendor. We are aware that there are over 600 jobs in the slurm queue stuck in 'Completing' state, presumably because they were unable to finalise any file IO before exiting.

Posted Sep 25, 2023 - 05:10 AWST

Update

Both meta data servers have been STONITHed. A critical case with HPE has been lodged.

Posted Sep 22, 2023 - 17:34 AWST

Identified

Both meta data servers have booted. The Meta Data targets has been mounted and are currently in recovery mode.

Posted Sep 22, 2023 - 16:17 AWST

Investigating

We are currently investigating this issue.

Posted Sep 22, 2023 - 13:21 AWST

This incident affected: Lustre filesystems (/scratch filesystem) and Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition).