Cooling issue in Pawsey Centre
Resolved
This incident has been resolved.
Posted Dec 22, 2023 - 09:13 AWST
Update
The data portal has been returned to service.

We are continuing to monitor all systems for any further issues
Posted Dec 14, 2023 - 09:34 AWST
Update
Garrawarla has been returned to service. We are still working on returning the Data Portal restoration.
Posted Dec 13, 2023 - 11:00 AWST
Update
Remote visualisation services are back up now. We are still working on the Data Portal restoration.
Posted Dec 12, 2023 - 13:40 AWST
Update
We are continuing to monitor for any further issues.
Posted Dec 12, 2023 - 13:13 AWST
Monitoring
Setonix has been returned to service. HPE have successfully upgraded scratch and software to the latest supported NEO release.

The SLURM update has been postponed.

If you have any issues, please contact help@pawsey.org.au. Be kind.
Posted Dec 12, 2023 - 12:54 AWST
Update
We are continuing to work on a fix for this issue.
Posted Dec 11, 2023 - 12:30 AWST
Update
HPE will be working through the weekend to return Setonix and Garrawarla to service next week. The current forecast date for services to return is Tuesday 12 December.

Data portal services remain offline. We are working to restore these services and are waiting on vendor repair or replacement of equipment damaged during the EPO event.

If you have any questions or find issues with the systems when they are returned, please kindly reach out to help@pawsey.org.au and we will help out where we can.
Posted Dec 08, 2023 - 12:25 AWST
Update
We are continuing to work on a fix for this issue.
Posted Dec 08, 2023 - 11:34 AWST
Update
Acacia (Ingest) has passed final testing and has been returned to service.

The Nimbus has passed final testing and has been returned to service (including Fgenesh++)
Posted Dec 06, 2023 - 14:33 AWST
Update
Acacia (Projects) has passed final testing and has been returned to service.

Acacia (Ingest) is currently in final testing and should be returned to service soon.

The Nimbus control plane is operational, allowing compute nodes to be booted and tested. We expect Nimbus to be returned to service this afternoon.

Banksia is currently online and in final testing. The Data Portal system will be brought online and tested soon.

The ASKAP ingest nodes will be handed to ASKAP soon for them to test. ASKAP Buffer has been upgraded to Lustre 2.12.9.

HPE is still collecting diagnostics from Setonix. We have no ETA on when it will be returned to service.

If you have any questions or find issues with the systems when they are returned, please kindly reach out to help@pawsey.org.au and we will help out where we can.
Posted Dec 06, 2023 - 13:44 AWST
Update
The Firewall has been returned to service. JIRA, Confluence and Pawsey e-mail has been restored.

HPE are onsite performing hardware checks on Setonix.

Please be patient as we try and return services as quickly as possible, and be courteous when contacting the Helpdesk.
Posted Dec 05, 2023 - 11:50 AWST
Update
Power was restored to the Supercomputing Cell yesterday and preliminary checks have been positive.

We have identified an issue with the firewall which is currently blocking progress. We lodged a support call with the vendor yesterday afternoon, but are yet to hear from them.
Posted Dec 05, 2023 - 09:58 AWST
Identified
Cooling was restored Saturday, and the I/O Cell was re-energised on Sunday.

We are currently inspecting hardware to determine impact of the incident before commencing restoring networking and core services.
Posted Dec 04, 2023 - 07:50 AWST
Update
ALL Pawsey systems are currently powered down while we await the all-clear from our facilities team. This includes mail and helpdesk service. Please note that even when we receive the OK to start services again it may be some considerable time before all resources are fully available to users
Posted Dec 02, 2023 - 09:54 AWST
Update
Banksia disk arrays have been impacted by the cooling issue and Banksia has been shutdown.
Posted Dec 02, 2023 - 07:36 AWST
Update
We are continuing to investigate this issue.
Posted Dec 02, 2023 - 07:26 AWST
Update
Networking is impacted by the cooling issues and as such most services are at risk.
Posted Dec 02, 2023 - 07:13 AWST
Investigating
Overnight alerts indicate there is an issue with the cooling system within the Pawsey Centre.

Several of the disk arrays that provide storage to the /astro and /askapbuffer filesystems are showing overtemperature alarms, as are sensors mounted in the core network equipment. The full extent of the impact is not yet known, as some systems (notably Setonix) use a different primary cooling loop to others.
Posted Dec 02, 2023 - 04:44 AWST
This incident affected: Central Services (Authentication and Authorization, Service Desk, License Server, Application Portal, Origin, /home filesystem, /pawsey filesystem, Central Slurm Database, Nebula, Documentation), Nimbus (Ceph storage, Nimbus instances, Nimbus dashboard, Nimbus APIs), Garrawarla (Garrawarla workq partition, Garrawarla gpuq partition, Garrawarla asvoq partition, Garrawarla copyq partition, Garrawarla login node, Slurm Controller (Garrawarla)), The Australian Biocommons (Fgenesh++), Storage Systems (Acacia - Projects, Banksia, Data Portal Systems, MWA Nodes, CASDA Nodes, Acacia - Ingest, MWA ASVO), Lustre filesystems (/scratch filesystem (new), /software filesystem, /askapbuffer filesystem, /askapingest filesystem), Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition), and Visualisation Services (Setonix vis nodes, Nebula vis nodes, Visualisation Lab).