/scratch failed after NEO 6.6-021 update

Incident Report for Pawsey Supercomputing Research Centre

Resolved

This incident has been resolved.

Posted Jun 04, 2024 - 12:11 AWST

Monitoring

Just before Pawsey was going to return Setonix to service yesterday, HPE performed what should have been a minor hardware replacement on one of the Lustre servers. This resulted In one of the Lustre server crashing, resulting in the onsite HPE engineer spending over twenty-fours hours resolving the issue.

HPE have advised Pawsey the hardware issue is now resolved, and with this assurance Pawsey has removed login restrictions on Setonix and Garrawarla.

Pawsey will monitor both systems to observe the impact of HPE's advised workaround. If you have any issues, please kindly e-mail help@pawsey.org.au.

Posted Mar 12, 2024 - 11:57 AWST

Identified

HPE provided Pawsey a potential workaround for the instability issues experienced by the Scratch filesystem at 11:30 AM on Friday (8th March 2024). Pawsey staff spent all of Friday afternoon rebuilding compute, login and data mover node images and rebooted Setonix to consistently apply the workaround.

Testing has been performed over the weekend and highlighted the data mover node images need to be rebuilt.

The filesystem is being monitored and has so far stayed up.

Posted Mar 11, 2024 - 08:58 AWST

Investigating

All four metadata servers are offline. HPE are investigating.

Posted Mar 07, 2024 - 08:11 AWST

This incident affected: Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition).