Pawsey Supercomputing Research Centre
Identified - HPE are rebooting nodes in cabinets x1005, x1006 and x1007
Oct 15, 2024 - 15:04 AWST
Setonix Operational
Login nodes ? Operational
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Operational
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
Setonix gpu high mem partition Operational
Setonix gpu debug partition Operational
Lustre filesystems Operational
/scratch filesystem (new) ? Operational
/software filesystem ? Operational
/askapbuffer filesystem ? Operational
/askapingest filesystem ? Operational
Storage Systems Operational
Acacia - Projects ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Acacia - Ingest ? Operational
MWA ASVO ? Operational
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Garrawarla Operational
Garrawarla workq partition ? Operational
Garrawarla gpuq partition ? Operational
Garrawarla asvoq partition ? Operational
Garrawarla copyq partition ? Operational
Garrawarla login node Operational
Slurm Controller (Garrawarla) Operational
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Nimbus APIs ? Operational
Central Services Operational
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
Documentation ? Operational
Visualisation Services Operational
Remote Vis ? Operational
Vis scheduler ? Operational
Setonix vis nodes ? Operational
Nebula vis nodes ? Operational
Visualisation Lab Operational
Reservation ? Operational
CARTA - Stable ? Operational
CARTA - Test ? Operational
Pawsey Remote VR Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Allocated nodes (Garrawarla workq partition)
Fetching
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Past Incidents
Oct 16, 2024

No incidents reported today.

Oct 15, 2024
Resolved - This incident has been resolved.
Oct 15, 15:10 AWST
Monitoring - A fix has been implemented and we are monitoring the results.
Oct 14, 13:45 AWST
Investigating - Connectivity from the MRO network is non-functional.
Oct 11, 16:08 AWST
Oct 14, 2024
Oct 13, 2024

No incidents reported.

Oct 12, 2024

No incidents reported.

Oct 11, 2024
Oct 10, 2024
Resolved - This incident has been resolved.
Oct 10, 10:25 AWST
Identified - The issue has been identified and a fix is being implemented.
Oct 10, 10:12 AWST
Investigating - We are currently investigating this issue.
Oct 10, 10:12 AWST
Oct 9, 2024
Completed - Garrawarla and Setonix have both passed reframe testing. Reservations have been removed and both systems have been returned to service.

Please note /scratch is very full, please be courteous of other researchers and remove any unused data.

Oct 9, 17:00 AWST
Verifying - HPE have handed Setonix back to Pawsey.

Garrawarla and Setonix Login nodes are being powered back on and verified.

Reframe will be run on both systems before returning to service.

Oct 9, 15:07 AWST
Update - Maintenance on Banksia is complete and has been returned to service.

Core services have been patched.

Tomorrow, the Setonix login, data mover and visualisation nodes as well as Garrawarla login nodes will be shutdown to allow the IOR test to be conducted. Once the IOR tests are complete, the Setonix login, data mover and visualisation nodes as well as Garrawarla login and GPU nodes will be returned to service.

Oct 8, 16:02 AWST
Update - Scheduled maintenance is still in progress. We will provide updates as necessary.
Oct 8, 15:19 AWST
Update - Scheduled maintenance is still in progress. We will provide updates as necessary.
Oct 8, 08:27 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 8, 05:00 AWST
Update - Starting 8 AM (AWST) on the 9th of October 2024, the Setonix login, data mover and visualisation nodes as well as Garrawarla login nodes will be shutdown to allow the IOR test to be conducted. Once the IOR tests are complete, the Setonix login, data mover and visualisation nodes as well as Garrawarla login and GPU nodes will be returned to service.

This was originally scheduled for the 8th of October 2024 but has been pushed back one day to support the CARTA workshop. Garrawarla will be operational through the 8th of October 2024.

Oct 7, 10:04 AWST
Update - We will be undergoing scheduled maintenance during this time.
Oct 7, 10:03 AWST
Scheduled - To support the Radio Astronomy School, our regular first-Tuesday-of-the-month maintenance will occur on the second-Tuesday-of-the-month in October.

Commencing on the 8th of October 2024 at 5 AM (AWST), HPE will commence a number of benchmarks on Setonix including HPL on CPU and GPU partitions, IOR on the scratch filesystem and OSU micro benchmarks. Starting 8 AM (AWST) Setonix login, data mover and visualisation nodes as well as Garrawarla login nodes will be shutdown to allow the IOR test to be conducted. Once the IOR tests are complete, the Setonix login, data mover and visualisation nodes as well as Garrawarla login and GPU nodes will be returned to service.

Due to the nature of the HPL benchmark, HPE require exclusive access to the CPU and GPU partitions on Setonix for up to two days. As such, Pawsey will have to wait until Thursday, 10th September at 5 AM (AWST) for Setonix compute nodes to be handed back. Pawsey will attempt to return Setonix to full service as quickly as possible after this.

Maintenance will be carried out on all Pawsey systems on Tuesday the 8th of October to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

Planned work for this window includes:
• Upgrade of ScoutAM software and firmware of DDN hardware of Banksia
• Patching of core Pawsey services

Further updates will be provided on status.pawsey.org.au, and any questions should be directed to help@pawsey.org.au.

Sep 27, 09:31 AWST
Oct 8, 2024
Oct 7, 2024
Resolved - Both Banksia tape libraries are online and staging and archiving files as normal.
Oct 7, 12:54 AWST
Investigating - The Banksia service is in a degraded/"at risk" state due it operating on only one tape library, rather than two as default. This will mean that the primary copy of files will be unavailable for staging or archiving until Library1 is returned to service.
Oct 7, 09:36 AWST
Oct 6, 2024

No incidents reported.

Oct 5, 2024

No incidents reported.

Oct 4, 2024

No incidents reported.

Oct 3, 2024

No incidents reported.

Oct 2, 2024

No incidents reported.