Get webhook notifications whenever Pawsey Supercomputing Research Centre creates an incident, updates an incident, resolves an incident or changes a component status.
Identified - The internal login node on galaxy (galaxy-int) that is used for interactive slurm jobs has been logging PCIe bus errors recently. HPE/Cray staff are aware and will investigate further. Although compute blades can be removed for maintenance in a warm-swap operation, the internal login node (c0-0c0s0n2) sits on the same blade as the internal boot node (c0-0c0s0n1) for the system, meaning repair may be a more disruptive process.
Nov 07, 2022 - 06:28 AWST
Setonix
Major Outage
Login nodes
?
Under Maintenance
Data-mover nodes
?
Operational
Slurm scheduler
?
Operational
Setonix work partition
Operational
Setonix debug partition
Degraded Performance
Setonix long partition
Major Outage
Setonix copy partition
Degraded Performance
Setonix askaprt partition
Operational
Setonix highmem partition
Operational
Lustre filesystems
Operational
90 days ago
100.0
% uptime
Today
/scratch filesystem (new)
?
Operational
90 days ago
100.0
% uptime
Today
/software filesystem
?
Operational
90 days ago
100.0
% uptime
Today
/scratch filesystem (legacy)
?
Operational
90 days ago
100.0
% uptime
Today
/group filesystem
?
Operational
90 days ago
100.0
% uptime
Today
/astro filesystem
?
Operational
90 days ago
100.0
% uptime
Today
/askapbuffer filesystem
?
Operational
90 days ago
100.0
% uptime
Today
/askapingest filesystem
?
Operational
90 days ago
100.0
% uptime
Today
Storage Systems
Operational
90 days ago
100.0
% uptime
Today
Acacia
?
Operational
Banksia
?
Operational
Data Portal Systems
?
Operational
MWA Nodes
Operational
CASDA Nodes
Operational
Ingest
Operational
MWA ASVO
?
Operational
90 days ago
100.0
% uptime
Today
Topaz
Operational
GPU partition
?
Operational
Topaz login nodes
Operational
Slurm Controller (topaz)
?
Operational
ASKAP
Operational
ASKAP ingest nodes
?
Operational
ASKAP service nodes
Operational
Garrawarla
Operational
Garrawarla workq partition
?
Operational
Garrawarla gpuq partition
?
Operational
Garrawarla asvoq partition
?
Operational
Garrawarla copyq partition
?
Operational
Garrawarla login node
Operational
Slurm Controller (Garrawarla)
Operational
Nimbus
Operational
Ceph storage
?
Operational
Nimbus instances
?
Operational
Nimbus dashboard
?
Operational
Central Services
Operational
90 days ago
100.0
% uptime
Today
Authentication and Authorization
?
Operational
Service Desk
Operational
License Server
Operational
Application Portal
?
Operational
Origin
?
Operational
/home filesystem
Operational
/pawsey filesystem
Operational
Central Slurm Database
?
Operational
90 days ago
100.0
% uptime
Today
Visualisation Services
Operational
Remote Vis
?
Operational
Nebula
?
Operational
Visualisation Lab
Operational
The Australian Biocommons
Operational
Fgenesh++
?
Operational
Legacy Systems
Operational
Galaxy Compute nodes
?
Operational
Galaxy login nodes
?
Operational
Slurm Controller (Galaxy)
Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Related
No incidents or maintenance related to this downtime.
Completed -
The work to connect Phase 2 of the Setonix system and implement the upgraded Cassini cards has been completed and Phase 1 has now been returned to service.
While we understand researchers frustration at the delay, the work was required to ensure Setonix was restored as a stable system.
Pawsey staff and our vendor partners continue to work on Phase 2 to bring that into service within the February timeframe.
Please note the February maintenance window for Setonix will be extended to 2 days (Tuesday 7th and Wednesday 8th of February) to allow sufficient time to make configuration changes to Phase 1 to allow Phase 2 to be brought on-line.
We appreciate your support and ask if you have any questions please e-mail help@pawsey.org.au.
Jan 25, 15:09 AWST
Update -
Our vendor is working on bringing Setonix back into service in it's final configuration. The remaining services at Pawsey should all be operational. If you are having issues, please create a ticket via https://pawsey.org.au/support/
While Setonix was offline, Pawsey took the opportunity to request HPE to replace all the network cards in Phase 1 with the upgraded Cassini cards, these are the same network cards which are in Phase 2. This is a significant improvement to the networking capacity and capability with higher bandwidth (200 Gb/s) and come with an updated libfabric which resolves many of the MPI issues encountered with Phase 1. HPE are rectifying issues with the high-speed fabric, which is taking longer than expected, this work is required to ensure Phase 1 is restored in a resilient fashion.
While we understand researchers frustration at the delay, the work currently being done will ensure that Setonix is restored as a stable system as well as allowing us to bring Phase 2 into production as soon as possible.
Jan 19, 12:06 AWST
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 13, 08:00 AWST
Scheduled -
Over the weekend of 14/15 January, the high voltage supply switchgear for Pawsey will undergo its annual inspection and maintenance. This requires that the supplies are isolated and so ALL Pawsey systems will be shut down for the weekend. Systems will be shut down from Friday at 08:00 and startup resumed on Monday morning. We are expecting systems to be restored by the evening of Tuesday the 17th.
Nov 29, 08:25 AWST