Setonix nodes experiencing Lustre errors

Incident Report for Pawsey Supercomputing Research Centre

Resolved

Pawsey has completed it testing of nodes (STREAM and HPL) and application testing (REFRAME).

We have returned Setonix back to service.

Posted Mar 20, 2023 - 14:04 AWST

Update

HPE have handed Phase 1 back to Pawsey after checking for flapping links, and running dgnettest, MPI all-to-all and IOR tests.

Posted Mar 20, 2023 - 12:40 AWST

Update

HPE have converted all of Phase 1 to use a TCP configuration. They are currently waiting on slingshot diagnostics before handing Phase 1 back to Pawsey for testing.

Posted Mar 20, 2023 - 08:38 AWST

Identified

HPE have taken control of Setonix. Engineering in the US have advised to covert Lustre from using SoftRoCE to TCP, which is a supported configuration.

Posted Mar 18, 2023 - 07:26 AWST

Update

Approximately 50% of compute nodes are experiencing Lustre Errors, and Pawsey has pro-actively placed a reservation on the system to allow it to drain.

We are discussing with HPE a resolution to the issue.

Posted Mar 17, 2023 - 08:27 AWST

Investigating

A significant number of nodes in Setonix are experiencing Lustre errors. A support case was raised this morning with HPE at a critical level, and they are investigating the issue.

At this stage we don't know the cause of the issue.

Posted Mar 16, 2023 - 17:48 AWST

This incident affected: Setonix (Login nodes, Data-mover nodes, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition).