Setonix nodes experiencing Lustre errors
Resolved
Pawsey has completed it testing of nodes (STREAM and HPL) and application testing (REFRAME).

We have returned Setonix back to service.
Posted Mar 20, 2023 - 14:04 AWST
Update
HPE have handed Phase 1 back to Pawsey after checking for flapping links, and running dgnettest, MPI all-to-all and IOR tests.
Posted Mar 20, 2023 - 12:40 AWST
Update
HPE have converted all of Phase 1 to use a TCP configuration. They are currently waiting on slingshot diagnostics before handing Phase 1 back to Pawsey for testing.
Posted Mar 20, 2023 - 08:38 AWST
Identified
HPE have taken control of Setonix. Engineering in the US have advised to covert Lustre from using SoftRoCE to TCP, which is a supported configuration.
Posted Mar 18, 2023 - 07:26 AWST
Update
Approximately 50% of compute nodes are experiencing Lustre Errors, and Pawsey has pro-actively placed a reservation on the system to allow it to drain.

We are discussing with HPE a resolution to the issue.
Posted Mar 17, 2023 - 08:27 AWST
Investigating
A significant number of nodes in Setonix are experiencing Lustre errors. A support case was raised this morning with HPE at a critical level, and they are investigating the issue.

At this stage we don't know the cause of the issue.
Posted Mar 16, 2023 - 17:48 AWST
This incident affected: Setonix (Login nodes, Data-mover nodes, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition).