issues with /astro metadata servers
Incident Report for Pawsey Supercomputing Centre
Resolved
No further issues seen with the two MDS servers throughout the day.
Posted Jan 12, 2021 - 05:26 AWST
Monitoring
Both MDSs seem to be performing normally following a reboot
Posted Jan 11, 2021 - 07:01 AWST
Identified
The pair of servers that make up the metadata service for /astro (running the MGS, and both MDTs) aren't handling failover correctly. One server rebooted earlier this morning but the disk LUNs aren't being correctly handed off to the partner. This will require a reboot of one, possibly both servers.
During this time access to /astro will be impacted.
Posted Jan 11, 2021 - 04:14 AWST
This incident affected: Magnus (Magnus Compute nodes), Garrawarla (Garrawarla compute nodes), Zeus (Zeus Compute nodes, Data Mover nodes (CopyQ)), and Lustre filesystems (/astro filesystem).