Get webhook notifications whenever Pawsey Supercomputing Research Centre creates an incident, updates an incident, resolves an incident or changes a component status.
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 29, 2023 - 09:00 AWST
Scheduled - After many years of service, Topaz will be shutdown and decommissioned.
Setonix
Degraded Performance
90 days ago
97.78
% uptime
Today
Login nodes
Setonix user-facing login nodes
Operational
Data-mover nodes
copy nodes for setonix
Operational
Slurm scheduler
Setonix batch system
Operational
Setonix work partition
Operational
Setonix debug partition
Operational
Setonix long partition
Degraded Performance
Setonix copy partition
Operational
Setonix askaprt partition
Operational
Setonix highmem partition
Operational
Setonix gpu partition
Operational
90 days ago
97.78
% uptime
Today
Setonix gpu high mem partition
Operational
90 days ago
97.78
% uptime
Today
Setonix gpu debug partition
Operational
90 days ago
97.78
% uptime
Today
Lustre filesystems
Operational
90 days ago
99.83
% uptime
Today
/scratch filesystem (new)
The main filesystem used by HPC jobs on Setonix
Operational
90 days ago
98.98
% uptime
Today
/software filesystem
Project software lustre storage for Setonix
Operational
90 days ago
100.0
% uptime
Today
/group filesystem
Medium term storage allocated to HPC projects
Operational
90 days ago
100.0
% uptime
Today
/astro filesystem
Dedicated astronomy lustre filesystem for MWA
Operational
90 days ago
100.0
% uptime
Today
/askapbuffer filesystem
Dedicated ASKAP filesystem (askapfs1)
Operational
90 days ago
100.0
% uptime
Today
/askapingest filesystem
E1000 NVMe (askapfs2)
Operational
90 days ago
100.0
% uptime
Today
Storage Systems
Operational
90 days ago
100.0
% uptime
Today
Acacia - Projects
Acacia object storage system, for warm storage
This partition is designed for storage of research project data
Operational
Banksia
Cool storage.
Operational
Data Portal Systems
Pawsey Data Portal.
Operational
MWA Nodes
Operational
CASDA Nodes
Operational
Acacia - Ingest
Acacia object storage system, for warm storage
This partition is designed for ingest of radioastronomy data
Operational
MWA ASVO
Virtual Observatory
This is reliant on the Acacia Ingest cluster to be operational
Operational
90 days ago
100.0
% uptime
Today
ASKAP
Operational
ASKAP ingest nodes
dedicated data ingest nodes to accept data from the Murchison Radio-astronomy Observatory (MRO)
https://www.csiro.au/en/Research/Astronomy/ASKAP-and-the-Square-Kilometre-Array/MRO
Operational
ASKAP service nodes
Operational
Garrawarla
Operational
Garrawarla workq partition
Dedicated cluster for MWA - Default CPU slurm partition (72 nodes with 38 Xeon cores per node)
Operational
Garrawarla gpuq partition
Dedicated MWA V100 GPU partition (72 nodes)
Operational
Garrawarla asvoq partition
Garrawarla 6 node CPU partition dedicated to All-Sky Virtual Observatory tasks
Operational
Garrawarla copyq partition
2 node data transfer partition primarily for ASVO staging tasks
Operational
Garrawarla login node
Operational
Slurm Controller (Garrawarla)
Operational
Nimbus
Operational
Ceph storage
Root and storage volumes for Nimbus instances
Operational
Nimbus instances
Whether Nimbus instances are running and accessible
Operational
Nimbus dashboard
Whether the Horizon dashboard is accessible to create and change Nimbus resources
Operational
Nimbus APIs
Nimbus APIs, used by the Nimbus dashboard, openstack CLI, and other external tools like Terraform.
Operational
Central Services
Operational
90 days ago
100.0
% uptime
Today
Authentication and Authorization
LDAP and Single Sign On for all Pawsey Users
Operational
Service Desk
Operational
License Server
Operational
Application Portal
Web service for potential users requesting access
Operational
Origin
Self-service portal for users and principal investigators showing usage / access.
Operational
/home filesystem
Operational
/pawsey filesystem
Operational
Central Slurm Database
The centralised accounting used by all clusters
Operational
90 days ago
100.0
% uptime
Today
Visualisation Services
Operational
Remote Vis
Topaz Remote Visualisation
Operational
Nebula
Windows Cluster (https://nebula.pawsey.org.au)
Operational
Visualisation Lab
Operational
The Australian Biocommons
Operational
Fgenesh++
Fgenesh++ is a bioinformatics pipeline for automatic prediction of genes in eukaryotic genomes.
Operational
Legacy Systems
Under Maintenance
GPU partition
Production Slurm 'gpuq' partition on Topaz consisting of 20 nodes
Operational
Topaz login nodes
Under Maintenance
Slurm Controller (topaz)
slurmctld for the topaz cluster
Under Maintenance
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Related
No incidents or maintenance related to this downtime.
Maintenance will be carried out on Pawsey systems on Tuesday the 3rd of October to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.
Planned work for this window includes: * Patching of core Pawsey service (including LDAP, Jira and Confluence) * Upgrade Tape Library firmware * Replacement of DDN hardware * Running HPL on Setonix as part of acceptance testing Posted on
Sep 26, 2023 - 15:36 AWST
Completed -
Coolant has been flushed. Nodes returned to service.
Sep 29, 09:54 AWST
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 25, 07:00 AWST
Scheduled -
HPE will be flushing and replacing the coolant in the racks delivered as part of Phase 1 (nid001000 to nid001511). The highmem partition will be unavailable and the work partition will be operating at reduced capacity.
Sep 18, 14:37 AWST
Resolved -
We have run reframe tests over the two cabinets and return the nodes to service which passed 100% of our tests.
Sep 27, 15:23 AWST
Investigating -
CPU nodes in cabinets x1002 and x1003 have powered off. HPE are investigating. Any jobs running on nodes nid00[1512-2023] will have failed.
Sep 27, 10:27 AWST
Resolved -
This incident has been resolved with our Networks provider.
Sep 27, 13:06 AWST
Update -
We are continuing to investigate this issue.
Sep 27, 09:55 AWST
Investigating -
We have noticed emails sent to pawsey.org.au addresses are not being received by our mail server and we are currently investigating the issue. This is affecting our Help Desk (help@pawsey.org.au), however directly lodging any tickets through the web interface at https://support.pawsey.org.au/portal/servicedesk/customer/portal/5/user/login (using your pawsey credentals) should still work.
Sep 27, 09:44 AWST
Resolved -
We have seen no further issues with the meta data servers. We are still waiting for a root cause analysis from HPE.
Sep 26, 15:26 AWST
Monitoring -
Both meta data targets have been remounted by HPE engineers and they are monitoring the system. A root cause of the issue is currently under investigation.
Sep 25, 14:25 AWST
Update -
Today is a public holiday in Western Australia, however we are still monitoring this incident and awaiting an update from our vendor. We are aware that there are over 600 jobs in the slurm queue stuck in 'Completing' state, presumably because they were unable to finalise any file IO before exiting.
Sep 25, 05:10 AWST
Update -
Both meta data servers have been STONITHed. A critical case with HPE has been lodged.
Sep 22, 17:34 AWST
Identified -
Both meta data servers have booted. The Meta Data targets has been mounted and are currently in recovery mode.
Sep 22, 16:17 AWST
Investigating -
We are currently investigating this issue.
Sep 22, 13:21 AWST
Resolved -
This incident has been resolved.
Sep 19, 17:42 AWST
Monitoring -
All RADOS gateway nodes had become overloaded and we are restarting them. The service appears stable again but we will monitor it.
Sep 19, 08:44 AWST
Investigating -
Acacia Ingest is experiencing 504 gateway timeouts. We are investigating.
Sep 19, 08:26 AWST