/scratch failed after NEO 6.6-021 update
Monitoring
Just before Pawsey was going to return Setonix to service yesterday, HPE performed what should have been a minor hardware replacement on one of the Lustre servers. This resulted In one of the Lustre server crashing, resulting in the onsite HPE engineer spending over twenty-fours hours resolving the issue.

HPE have advised Pawsey the hardware issue is now resolved, and with this assurance Pawsey has removed login restrictions on Setonix and Garrawarla.

Pawsey will monitor both systems to observe the impact of HPE's advised workaround. If you have any issues, please kindly e-mail help@pawsey.org.au.
Posted Mar 12, 2024 - 11:57 AWST
Identified
HPE provided Pawsey a potential workaround for the instability issues experienced by the Scratch filesystem at 11:30 AM on Friday (8th March 2024). Pawsey staff spent all of Friday afternoon rebuilding compute, login and data mover node images and rebooted Setonix to consistently apply the workaround.

Testing has been performed over the weekend and highlighted the data mover node images need to be rebuilt.

The filesystem is being monitored and has so far stayed up.
Posted Mar 11, 2024 - 08:58 AWST
Investigating
All four metadata servers are offline. HPE are investigating.
Posted Mar 07, 2024 - 08:11 AWST
This incident affects: Garrawarla (Garrawarla workq partition, Garrawarla gpuq partition, Garrawarla asvoq partition, Garrawarla copyq partition, Garrawarla login node, Slurm Controller (Garrawarla)) and Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition).