Setonix Administrative Control Plane Unresponsive

Resolved

There has been no further issues with the control plane overnight. We are currently writing a PIR to determine if there are ways to minimise the impact in the future.
Posted May 21, 2025 - 06:25 AWST

Monitoring

HPE are collecting logs, but Setonix appears to have soldiered on irregardless.
Posted May 20, 2025 - 20:14 AWST

Investigating

Setonix's administrative control plane in unresponsive. Setonix continues to run jobs, but node name resolution is compromised.

A Critical Case has been raised with HPE.
Posted May 20, 2025 - 19:07 AWST
This incident affected: Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition).