Resolved -
HPE believe the issue has been resolved and has closed the support case.
They believe the issue was:
"Global Flow Control was disabled on the E1000's. Once enabled performance was regained. ClusterStor team working on a fix (CSPROD-18819) to make Global Flow Control enabled all the time moving forward."
Jan 27, 10:06 AWST
Update -
HPE made change to the Global Flow Control on scratch and software during January's maintenance, as well as a modification to the configuration of the LAG ports in the Slingshot fabric.
We haven't seen any C_EC_CRIT errors on the login nodes since maintenance, and are continuing to monitor them like hawks.
Jan 19, 08:21 AWST
Update -
setonix-04 reported a C_EC_CRIT error yesterday. It is not in the login pool, but HPE are stumped at why this is happening.
Nov 7, 13:40 AWST
Monitoring -
HPE rebooted a number of Slingshot switches during maintenance.
We haven't observed any Slingshot errors on the login, data mover or visualisation nodes for 48 hours.
We will continue to monitor.
Nov 6, 12:29 AWST
Update -
HPE have provided no new information
Oct 31, 08:11 AWST
Update -
HPE have provided no new information.
Oct 27, 10:59 AWST
Update -
HPE have provided no new information.
Oct 24, 21:01 AWST
Update -
HPE have provided no new information.
setonix-08 has slingshot issues. Pawsey is rebooting it.
Oct 20, 13:25 AWST
Update -
setonix-02 and setonix-03 have been added back to the RR DNS.
Oct 16, 14:09 AWST
Investigating -
There appears to be an issue will the Slingshot interfaces in the login nodes in Setonix. We appear to be down to 1 login node in the normal pool of login nodes.
We have had a case open with HPE for weeks, but they appear to be no closer to providing any kind of solution.
Please, please, please, please don't run any computational intensive operations on the login nodes. We have lovely compute nodes for that.
Please be aware that you can log into setonix-workflow.pawsey.org.au and get access to additional "workflow" nodes.
Oct 16, 12:02 AWST