Nimbus authentication services offline
Incident Report for Pawsey Supercomputing Centre
Postmortem

A script which controls our automation suite had insufficient error checking, and was called with incorrect parameters. This resulted in incorrect configuration being propagated to a group of machines, causing the error.

To script was rerun with correct parameters and the cluster issues rectified. The script has had improved error checking added, as well as other checking and monitoring mitigations put in place.

Posted Oct 20, 2020 - 09:24 AWST

Resolved
After some additional fixes to the message queue, the system has been stable since 11:30 AWST.
Posted Oct 07, 2020 - 15:26 AWST
Monitoring
A problem with the message queue system was stopping authentication messages from reaching the authentication layer. This has been addressed, and we are monitoring the system for any related errors while the system stabilises.
Posted Oct 07, 2020 - 10:17 AWST
Investigating
Users will not be able to log into the Nimbus dashboard at this time. We are currently investigating the issue.
Posted Oct 07, 2020 - 09:00 AWST
This incident affected: Nimbus (Nimbus dashboard).