Cooling Unit (CDU) 1 has failed in Setonix

Resolved

CDU1 has survived overnight.
Posted Sep 04, 2025 - 11:42 AWST

Update

The CDU1 failed again this morning.

HPE discovered the expansion tank had no pressure. They have pumped the bladder tank to 18PSI.

All cabinets, chassis and nodes are now powered up and nodes are booted.

We will pray.
Posted Sep 03, 2025 - 11:37 AWST

Monitoring

HPE have powered back on the affected racks, and all nodes bar one have passed testing.

Pawsey will ask for a Post Incident Report from HPE and monitor the system overnight.
Posted Sep 02, 2025 - 17:50 AWST

Investigating

CDU 1 has failed in Setonix which has resulting in Racks X1002, X1003 and X1004 off lining all nodes in those racks.

Approximately half of the GPU capacity and one third of the CPU capacity of Setonix is unavailable.

HPE are onsite and investigating, but have no ETA on when the cooling unit will be restored to service.
Posted Sep 02, 2025 - 17:10 AWST
This incident affected: Setonix (Setonix work partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition).