Setonix Compute Node partial outage part II

Incident Report for Pawsey Supercomputing Research Centre

Resolved

This incident has been resolved.

Posted Nov 21, 2023 - 14:38 AWST

Monitoring

The cabinets in Setonix that were shutting down has now passed our stress test and our functional tests and have been returned to service. HPE and ourselves will be monitoring how it performs overnight. The other cabinets that run jobs in the work partition have been running fine during this incident and have worked their way through most of the jobs in the work partition so there are free nodes now. When researchers put more jobs in the queue we will be able to see how the serviced cabinets run.

Posted Nov 14, 2023 - 15:28 AWST

Update

We are continuing to work on a fix for this issue.

Posted Nov 14, 2023 - 14:21 AWST

Update

HPE have identified some work in the internal cooling system for the affected cabinets and carried out the work to address the issues. The nodes in these cabinets have been brought up and Pawsey are running stress tests on them currently to gain some confidence that the coolant distribution unit isn't going to shut the cabinets down again. During this time you may see many jobs in the slurm queue from the user moshea while we perform this work. My jobs will all be terminated within the hour. Assuming this test is successful we will run our usual functional tests and if all is well return the missing nodes back in to service. If all goes well we expect this to be within an hour or so.

Posted Nov 14, 2023 - 14:21 AWST

Identified

HPE have had advice to install new filters to their coolant distribution units and clean all sensors and have completed the work. They will monitor over night and if all is well in the morning Pawsey will put a load on the nodes in the affected cabinets to see how the cooling system copes.

Posted Nov 13, 2023 - 16:01 AWST

Investigating

The same two cabinets that powered off over the weekend have again performed an emergency poweroff. Our vendor is investigating the cause and we'll likely keep them out of service until they can provide an explanation as to what is causing this. The rest of the compute nodes are still up and operational and are running jobs.

Posted Nov 13, 2023 - 11:22 AWST

This incident affected: Setonix (Setonix work partition).