Pawsey has completed it testing of nodes (STREAM and HPL) and application testing (REFRAME).
We have returned Setonix back to service.
Posted Mar 20, 2023 - 14:04 AWST
Update
HPE have handed Phase 1 back to Pawsey after checking for flapping links, and running dgnettest, MPI all-to-all and IOR tests.
Posted Mar 20, 2023 - 12:40 AWST
Update
HPE have converted all of Phase 1 to use a TCP configuration. They are currently waiting on slingshot diagnostics before handing Phase 1 back to Pawsey for testing.
Posted Mar 20, 2023 - 08:38 AWST
Identified
HPE have taken control of Setonix. Engineering in the US have advised to covert Lustre from using SoftRoCE to TCP, which is a supported configuration.
Posted Mar 18, 2023 - 07:26 AWST
Update
Approximately 50% of compute nodes are experiencing Lustre Errors, and Pawsey has pro-actively placed a reservation on the system to allow it to drain.
We are discussing with HPE a resolution to the issue.
Posted Mar 17, 2023 - 08:27 AWST
Investigating
A significant number of nodes in Setonix are experiencing Lustre errors. A support case was raised this morning with HPE at a critical level, and they are investigating the issue.
At this stage we don't know the cause of the issue.
Posted Mar 16, 2023 - 17:48 AWST
This incident affected: Setonix (Login nodes, Data-mover nodes, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition).