Pawsey Supercomputing Research Centre
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 29, 2023 - 09:00 AWST
Scheduled - After many years of service, Topaz will be shutdown and decommissioned.
Setonix Degraded Performance
90 days ago
97.78 % uptime
Today
Login nodes
Operational
Data-mover nodes
Operational
Slurm scheduler
Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Degraded Performance
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
90 days ago
97.78 % uptime
Today
Setonix gpu high mem partition Operational
90 days ago
97.78 % uptime
Today
Setonix gpu debug partition Operational
90 days ago
97.78 % uptime
Today
Lustre filesystems Operational
90 days ago
99.83 % uptime
Today
/scratch filesystem (new)
Operational
90 days ago
98.98 % uptime
Today
/software filesystem
Operational
90 days ago
100.0 % uptime
Today
/group filesystem
Operational
90 days ago
100.0 % uptime
Today
/astro filesystem
Operational
90 days ago
100.0 % uptime
Today
/askapbuffer filesystem
Operational
90 days ago
100.0 % uptime
Today
/askapingest filesystem
Operational
90 days ago
100.0 % uptime
Today
Storage Systems Operational
90 days ago
100.0 % uptime
Today
Acacia - Projects
Operational
Banksia
Operational
Data Portal Systems
Operational
MWA Nodes Operational
CASDA Nodes Operational
Acacia - Ingest
Operational
MWA ASVO
Operational
90 days ago
100.0 % uptime
Today
ASKAP Operational
ASKAP ingest nodes
Operational
ASKAP service nodes Operational
Garrawarla Operational
Garrawarla workq partition
Operational
Garrawarla gpuq partition
Operational
Garrawarla asvoq partition
Operational
Garrawarla copyq partition
Operational
Garrawarla login node Operational
Slurm Controller (Garrawarla) Operational
Nimbus Operational
Ceph storage
Operational
Nimbus instances
Operational
Nimbus dashboard
Operational
Nimbus APIs
Operational
Central Services Operational
90 days ago
100.0 % uptime
Today
Authentication and Authorization
Operational
Service Desk Operational
License Server Operational
Application Portal
Operational
Origin
Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database
Operational
90 days ago
100.0 % uptime
Today
Visualisation Services Operational
Remote Vis
Operational
Nebula
Operational
Visualisation Lab Operational
The Australian Biocommons Operational
Fgenesh++
Operational
Legacy Systems Under Maintenance
GPU partition
Operational
Topaz login nodes Under Maintenance
Slurm Controller (topaz)
Under Maintenance
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Scheduled Maintenance
Pawsey Scheduled Maintenance (October) Oct 3, 2023 05:00-22:00 AWST
Maintenance will be carried out on Pawsey systems on Tuesday the 3rd of October to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

Planned work for this window includes:
* Patching of core Pawsey service (including LDAP, Jira and Confluence)
* Upgrade Tape Library firmware
* Replacement of DDN hardware
* Running HPL on Setonix as part of acceptance testing

Posted on Sep 26, 2023 - 15:36 AWST
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Allocated nodes (Garrawarla workq partition)
Fetching
Past Incidents
Oct 1, 2023

No incidents reported today.

Sep 30, 2023

No incidents reported.

Sep 29, 2023
Completed - Coolant has been flushed. Nodes returned to service.
Sep 29, 09:54 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 25, 07:00 AWST
Scheduled - HPE will be flushing and replacing the coolant in the racks delivered as part of Phase 1 (nid001000 to nid001511). The highmem partition will be unavailable and the work partition will be operating at reduced capacity.
Sep 18, 14:37 AWST
Sep 28, 2023

No incidents reported.

Sep 27, 2023
Resolved - We have run reframe tests over the two cabinets and return the nodes to service which passed 100% of our tests.
Sep 27, 15:23 AWST
Investigating - CPU nodes in cabinets x1002 and x1003 have powered off. HPE are investigating. Any jobs running on nodes nid00[1512-2023] will have failed.
Sep 27, 10:27 AWST
Resolved - This incident has been resolved with our Networks provider.
Sep 27, 13:06 AWST
Update - We are continuing to investigate this issue.
Sep 27, 09:55 AWST
Investigating - We have noticed emails sent to pawsey.org.au addresses are not being received by our mail server and we are currently investigating the issue. This is affecting our Help Desk (help@pawsey.org.au), however directly lodging any tickets through the web interface at https://support.pawsey.org.au/portal/servicedesk/customer/portal/5/user/login (using your pawsey credentals) should still work.
Sep 27, 09:44 AWST
Sep 26, 2023
Resolved - We have seen no further issues with the meta data servers. We are still waiting for a root cause analysis from HPE.
Sep 26, 15:26 AWST
Monitoring - Both meta data targets have been remounted by HPE engineers and they are monitoring the system. A root cause of the issue is currently under investigation.
Sep 25, 14:25 AWST
Update - Today is a public holiday in Western Australia, however we are still monitoring this incident and awaiting an update from our vendor. We are aware that there are over 600 jobs in the slurm queue stuck in 'Completing' state, presumably because they were unable to finalise any file IO before exiting.
Sep 25, 05:10 AWST
Update - Both meta data servers have been STONITHed. A critical case with HPE has been lodged.
Sep 22, 17:34 AWST
Identified - Both meta data servers have booted. The Meta Data targets has been mounted and are currently in recovery mode.
Sep 22, 16:17 AWST
Investigating - We are currently investigating this issue.
Sep 22, 13:21 AWST
Sep 25, 2023
Sep 24, 2023

No incidents reported.

Sep 23, 2023

No incidents reported.

Sep 22, 2023
Sep 21, 2023

No incidents reported.

Sep 20, 2023

No incidents reported.

Sep 19, 2023
Resolved - This incident has been resolved.
Sep 19, 17:42 AWST
Monitoring - All RADOS gateway nodes had become overloaded and we are restarting them. The service appears stable again but we will monitor it.
Sep 19, 08:44 AWST
Investigating - Acacia Ingest is experiencing 504 gateway timeouts. We are investigating.
Sep 19, 08:26 AWST
Sep 18, 2023

No incidents reported.

Sep 17, 2023

No incidents reported.