Pawsey Supercomputing Research Centre
Update - This complex issue with Lib2 has been escalated with Spectralogic management to obtain onsite support in order to get the issues with this library fully diagnosed and remedied with parts and updates and back online for second copy use again. Service remains degraded and at risk.
Apr 26, 2024 - 11:19 AWST
Update - We are continuing to investigate this issue.
Apr 22, 2024 - 08:24 AWST
Update - We are identifying why the robotics on Lib 2 have errored with Spectralogic Support.
Apr 22, 2024 - 08:24 AWST
Investigating - The Banksia service is in a degraded/"at risk" state due it operating on only one tape library, rather than two as default. This will mean that the redundant second copy of files will be unavailable for staging or archiving until Library 2 is returned to service.
Apr 22, 2024 - 08:21 AWST
Monitoring - Just before Pawsey was going to return Setonix to service yesterday, HPE performed what should have been a minor hardware replacement on one of the Lustre servers. This resulted In one of the Lustre server crashing, resulting in the onsite HPE engineer spending over twenty-fours hours resolving the issue.

HPE have advised Pawsey the hardware issue is now resolved, and with this assurance Pawsey has removed login restrictions on Setonix and Garrawarla.

Pawsey will monitor both systems to observe the impact of HPE's advised workaround. If you have any issues, please kindly e-mail help@pawsey.org.au.

Mar 12, 2024 - 11:57 AWST
Identified - HPE provided Pawsey a potential workaround for the instability issues experienced by the Scratch filesystem at 11:30 AM on Friday (8th March 2024). Pawsey staff spent all of Friday afternoon rebuilding compute, login and data mover node images and rebooted Setonix to consistently apply the workaround.

Testing has been performed over the weekend and highlighted the data mover node images need to be rebuilt.

The filesystem is being monitored and has so far stayed up.

Mar 11, 2024 - 08:58 AWST
Investigating - All four metadata servers are offline. HPE are investigating.
Mar 07, 2024 - 08:11 AWST
Setonix Operational
90 days ago
94.27 % uptime
Today
Login nodes ? Operational
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Operational
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
90 days ago
94.27 % uptime
Today
Setonix gpu high mem partition Operational
90 days ago
94.27 % uptime
Today
Setonix gpu debug partition Operational
90 days ago
94.27 % uptime
Today
Lustre filesystems Operational
90 days ago
94.03 % uptime
Today
/scratch filesystem (new) ? Operational
90 days ago
76.12 % uptime
Today
/software filesystem ? Operational
90 days ago
100.0 % uptime
Today
/askapbuffer filesystem ? Operational
90 days ago
99.99 % uptime
Today
/askapingest filesystem ? Operational
90 days ago
100.0 % uptime
Today
Storage Systems Operational
90 days ago
99.72 % uptime
Today
Acacia - Projects ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Acacia - Ingest ? Operational
MWA ASVO ? Operational
90 days ago
99.72 % uptime
Today
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Garrawarla Operational
Garrawarla workq partition ? Operational
Garrawarla gpuq partition ? Operational
Garrawarla asvoq partition ? Operational
Garrawarla copyq partition ? Operational
Garrawarla login node Operational
Slurm Controller (Garrawarla) Operational
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Nimbus APIs ? Operational
Central Services Operational
90 days ago
100.0 % uptime
Today
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
90 days ago
100.0 % uptime
Today
Nebula ? Operational
90 days ago
100.0 % uptime
Today
Documentation ? Operational
90 days ago
100.0 % uptime
Today
Visualisation Services Under Maintenance
90 days ago
100.0 % uptime
Today
Remote Vis ? Operational
Vis scheduler ? Operational
Setonix vis nodes ? Operational
Nebula vis nodes ? Operational
Visualisation Lab Operational
Reservation ? Operational
CARTA - Stable ? Operational
CARTA - Test ? Under Maintenance
90 days ago
100.0 % uptime
Today
The Australian Biocommons Operational
Fgenesh++ ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Allocated nodes (Garrawarla workq partition)
Fetching
Past Incidents
Apr 27, 2024

No incidents reported today.

Apr 26, 2024
Resolved - Scratch appears to be nominal. We are waiting on a RCA from HPE.
Apr 26, 08:18 AWST
Monitoring - The two OSTs which refused to reassemble (avenger style) have been put back together and have a filesystem check performed on them.

Scratch is now fully function, and staff will resume drained nodes on Setonix on Friday (26th April 2024).

Apr 25, 11:05 AWST
Identified - HPE are onsite and have replaced failed hardware. They are currently bringing the OSTs back and will perform a filesystem check when they are ready.
Apr 24, 19:26 AWST
Investigating - HPE are working on it
Apr 24, 15:27 AWST
Apr 25, 2024
Apr 24, 2024
Apr 23, 2024

No incidents reported.

Apr 22, 2024
Apr 21, 2024

No incidents reported.

Apr 20, 2024

No incidents reported.

Apr 19, 2024

No incidents reported.

Apr 18, 2024

No incidents reported.

Apr 17, 2024

No incidents reported.

Apr 16, 2024

No incidents reported.

Apr 15, 2024
Resolved - No re-occurance of event
Apr 15, 07:53 AWST
Monitoring - OST has been failed over to partner pair and filesystem has recovered.
Apr 12, 09:39 AWST
Investigating - We are investigating.
Apr 12, 09:29 AWST
Apr 14, 2024

No incidents reported.

Apr 13, 2024

No incidents reported.