Nimbus dashboard and authentication endpoint unavailable

Incident Report for Pawsey Supercomputing Research Centre

Resolved

The failover service dropped the IPs and monitoring did not catch the interruption, as a result of a combination of factors. The failover services did not successfully communicate with one another during a network restart, and the transitional state of the final stage of our control plane migration compounded the problem. The issues are now resolved, and work will continue to improve the resilience of the services.

Posted Mar 09, 2023 - 17:13 AWST

Monitoring

The main IP address became unresponsive. Restarting the failover service reinstated the IP, and all systems are now operating normally.

Posted Mar 09, 2023 - 15:49 AWST

Investigating

We are currently experiencing an outage with the Nimbus dashboard and API. Access to Nimbus instances themselves is not affected. We are currently investigating the issue.

Posted Mar 09, 2023 - 15:16 AWST

This incident affected: Nimbus (Nimbus dashboard).