Pawsey Supercomputing Centre
Update - Our administrators are attempting to free space on the critically full OSTs by migrating some large files to less full OSTs.
May 26, 15:53 AWST
Investigating - Garrawarla has begun draining nodes in response to the /astro filesystem nearing critical OST usage.
May 26, 15:12 AWST
Identified - One of the arrays in /askapbuffer (askapfs-array06) restarted one of the controller cards on Friday. This has been escalated to the vendor, who are investigating the error, and a replacement for a disk failure on Sunday has not yet been shipped as the vendor is waiting for a response from their 'escalation team'
Whilst there has been no visible outage to users, we're raising the incident incase we need to perform maintenance above normal hot-swappable parts.

May 16, 16:10 AWST
Magnus Operational
Magnus Compute nodes ? Operational
Magnus login nodes ? Operational
Slurm Controller (Magnus) ? Operational
Galaxy Operational
Galaxy Compute nodes ? Operational
Galaxy login nodes ? Operational
Slurm Controller (Galaxy) Operational
Topaz Operational
GPU partition ? Operational
Topaz login nodes Operational
Slurm Controller (topaz) ? Operational
Zeus Operational
Zeus Compute nodes ? Operational
Zeus login node Operational
Data Mover nodes (CopyQ) ? Operational
Slurm Controller (Zeus) Operational
Lustre filesystems Partial Outage
90 days ago
97.68 % uptime
/scratch filesystem ? Operational
90 days ago
100.0 % uptime
/group filesystem ? Operational
90 days ago
100.0 % uptime
/astro filesystem ? Partial Outage
90 days ago
88.44 % uptime
/askapbuffer filesystem ? Operational
90 days ago
99.99 % uptime
/askapingest filesystem ? Operational
90 days ago
100.0 % uptime
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Garrawarla Partial Outage
Garrawarla compute nodes ? Partial Outage
Garrawarla login node Operational
Slurm Controller (Garrawarla) Operational
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Storage Systems Operational
Data Portal Systems ? Operational
Banksia ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Long-term online storage Operational
Acacia Operational
Ingest Operational
Central Services Operational
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
Visualisation Services Operational
Remote Vis ? Operational
Nebula ? Operational
Visualisation Lab Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Degraded Performance
Partial Outage
Major Outage
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Active Instances (Nimbus)
Active Cores (Nimbus)
Allocated Nodes (Magnus workq) ?
Allocated Nodes (Galaxy workq) ?
Past Incidents
May 28, 2022

No incidents reported today.

May 27, 2022

No incidents reported.

May 26, 2022

Unresolved incident: Garrawarla draining nodes due to OST usage.

May 25, 2022
Resolved - The extreme slowness issues with Banksia last week appear now be resolved after vendor worked to clean up the system and implement a new software patch.
May 25, 11:25 AWST
Update - Banksia queue are now idle and the system is ready for use again. The Vendor has remedid short reads and dealt with one tape in particular that was causing a queue blockage. We will monitor for a few days and if all is operating optimally, we will look to closing this incident.
May 23, 09:53 AWST
Update - Banksia has been patched and files are once again staging from tape. The stage queue length has subsequently halved. The Vendor is completing further cleanup work and verification work on the filesystem cache which may slow things down over the next day or so, but the vast majority of queued up stage requests are now running. We will monitor this over today and over the weekend. We expect that the system should settle back down to normal by Monday.
May 20, 09:59 AWST
Update - The vendor is working on a new software release 2.6.3 for Banksia to remedy current issues that meant the tape queue scheduler was idled. This release is currently undergoing testing before it is scheduled to be rolled into production. More details on this date will be provided after testing is completed in a few days or so. Online files remain available.
May 19, 09:45 AWST
Update - Overnight the vendor has needed to idle the staging scheduler so presently files requested from tape won’t be retrieved and will be queued, however files already online are still available. The scheduler will be resumed after the next patch, due out shortly.
May 18, 08:49 AWST
Update - Banksia load has reduced further and the system state is looking better today. There still remains a volume of files to be recalled from tapes but this is a manageable number.
May 17, 09:46 AWST
Update - Pawsey and the Vendor are continuing to monitor for further issues, the system is still very busy post-cleanup and it is expected to be such for some days.
May 16, 09:04 AWST
Monitoring - The Vendor has resumed the jobs to stage files online, but there is quite a long tail of items to process and retrieve, mostly Pawsey Data Portal file requests. So the slow state may persist for the time being. Another update will be provided tomorrow.
May 14, 08:50 AWST
Update - Vendor has completed their initial work. The Staging Scheduler has been stopped so the system can process the backlog. This is unlikely to complete today.
May 13, 11:56 AWST
Update - Banksia has slowed down further.
We have met with the vendor and have a robust plan to address all known issues.

May 13, 11:41 AWST
Update - Banksia is under high load, operating extremely slowly, so although it is online but essentially unavailable.
The vendor is investigating and we will provide an update later today.

May 13, 09:53 AWST
Investigating - The Banksia system continues to experience intermittent periods of high load, nodes being ejected from the cluster and issues that affect the staging of some files from tape.

Work continues to further tune the system to ensure all migrated files stage successfully.

Please email for files that you require if they do not stage back after a day.

May 11, 13:38 AWST
May 24, 2022

No incidents reported.

May 23, 2022
May 22, 2022

No incidents reported.

May 21, 2022

No incidents reported.

May 20, 2022
Resolved - Both Magnus and Galaxy were restored to service mid afternoon

It appears that the variable speed drives that control some of the pumps suffered a fault. This has been escalated to the controls contractor to determine the root cause, however they reset successfully to allow us to restore services.

May 20, 17:36 AWST
Investigating - The Cray XC systems in Pawsey are currently offline due to a pressure drop on the 22 degree cooling loop
May 20, 07:15 AWST
May 19, 2022
May 18, 2022
May 17, 2022
May 16, 2022

Unresolved incident: /askapbuffer 'At Risk' of outage.

May 15, 2022

No incidents reported.

May 14, 2022