/scratch issues
Resolved
Upgraded to NEO 6.6-021 and flock re-instated.
Posted Mar 14, 2024 - 12:36 AWST
Update
We are continuing to monitor for any further issues.
Posted Feb 21, 2024 - 09:18 AWST
Update
setonix-03 has been resurrected and placed back into the the round robin.
Posted Jan 03, 2024 - 12:08 AWST
Update
HPE have recommended that /software and /scratch are mounted using localflock rather than flock. We have implemented the change across Setonix and Garrawarla, and have run our internal testing which passed.

We will monitor the systems but if you run into any issues with file locking, please reach out to help@pawsey.org.au.

Please note that setonix-03 is currently unavailable due to an internal configuration issue and we are removing it from the round robin DNS.
Posted Jan 02, 2024 - 13:41 AWST
Update
Once again the same two servers have powered off. A reservation has been placed across Setonix. We are waiting to hear back from HPE R&D.
Posted Dec 22, 2023 - 16:09 AWST
Update
The Vendor has restarted the two lustre servers again this afternoon.
Posted Dec 22, 2023 - 14:58 AWST
Update
The Vendor has restarted the two servers this morning and setonix-01 login node is responsive again. The slurm partitions remain drained until health checks on the compute nodes have completed.
Posted Dec 22, 2023 - 06:58 AWST
Update
Once again the same two servers have powered off overnight. This has been escalated to the Vendor
Posted Dec 22, 2023 - 05:26 AWST
Monitoring
HPE have restored the two Lustre Metadataservers. Pawsey has turned off quota enforcement as a precaution and have removed the reservation.

We will observe the system over the weekend, but are unlikely to get a root cause analysis from HPE until Tuesday.
Posted Dec 17, 2023 - 09:29 AWST
Identified
A reservation has been put in place across ALL nodes to prevent new jobs from starting until /scratch is restored
Posted Dec 16, 2023 - 16:20 AWST
Investigating
Both nodes that are capable of housing the Lustre Management Service (MGS) for /scratch are powered off. The net result of this is that /scratch is unavailable for users. A case has been logged with our vendor, however it's unlikely to be resolved over the weekend
Posted Dec 16, 2023 - 15:57 AWST
This incident affected: Garrawarla (Garrawarla workq partition, Garrawarla gpuq partition, Garrawarla asvoq partition, Garrawarla copyq partition, Garrawarla login node), Lustre filesystems (/scratch filesystem (new)), and Setonix (Login nodes, Data-mover nodes, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition).