Pawsey Scheduled Maintenance (May)
Scheduled Maintenance Report for Pawsey Supercomputing Research Centre
Completed
Maintenance is complete. All systems are returned to service.
Posted May 15, 2024 - 11:13 AWST
Update
Scheduled maintenance is still in progress. We will provide updates as necessary.
Posted May 15, 2024 - 10:33 AWST
Update
• Banksia was returned to service last night
• ASKAP Ingest nodes are ready for testing
• Setonix is waiting for the green light to be returned to service
• The Garrawarla nodes are being booted in preparation for testing.
Posted May 15, 2024 - 08:55 AWST
Update
HPE have completed one hardware replacement in /scratch. They are currently replacing two drives. Pawsey has begun testing the system.
The InfiniBand refactor is progressing well and testing has commenced.
Nimbus has been updated to "Victoria".
SLURM controllers have had their underlying operating system upgraded.
Patching of core services is complete.

At this stage we think it is unlikely that Setonix or Garrawarla will be returned to service until tomorrow morning as there is still significant testing required to be performed.
Posted May 14, 2024 - 16:16 AWST
Update
Scheduled maintenance is still in progress. We will provide updates as necessary.
Posted May 14, 2024 - 15:51 AWST
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted May 14, 2024 - 07:00 AWST
Scheduled
Maintenance will be carried out on Pawsey systems on Tuesday the 14th of May to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

Planned work for this window includes:
• Banksia will be upgraded to ScoutAM version 2.22
• SLURM controllers for ASKAP Ingest and Garrawarla will have their operating system upgraded
• SLURM DBD for Setonix, ASKAP Ingest and Garrawarla will have its operating system upgraded
• InfiniBand HCA Firmware on Garrawarla will be updated
• InfiniBand fabric will be re-factored to deprecate FDR hardware
• NetApp which provides /home on Setonix will have its firmware updated
• HPE will be replacing an IO module on one of the disk enclosure in /scratch
• An upgrade will be performed on the Nimbus cluster
• Patching of core Pawsey services

The maintenance window is two days to allow HPE to resolve any issues which may result from their hardware replacement, but we expect Setonix and Garrawarla to be returned to service on the 14th.
Posted May 03, 2024 - 12:36 AWST
This scheduled maintenance affected: ASKAP (ASKAP ingest nodes, ASKAP service nodes), Central Services (Authentication and Authorization, Service Desk, License Server, Application Portal, Origin, /home filesystem, /pawsey filesystem, Central Slurm Database, Nebula, Documentation), Nimbus (Ceph storage, Nimbus instances, Nimbus dashboard, Nimbus APIs), Garrawarla (Garrawarla workq partition, Garrawarla gpuq partition, Garrawarla asvoq partition, Garrawarla copyq partition, Garrawarla login node, Slurm Controller (Garrawarla)), The Australian Biocommons (Fgenesh++), Storage Systems (Acacia - Projects, Banksia, Data Portal Systems, MWA Nodes, CASDA Nodes, Acacia - Ingest, MWA ASVO), Lustre filesystems (/scratch filesystem (new), /software filesystem, /askapbuffer filesystem, /askapingest filesystem), Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition), and Visualisation Services (Remote Vis, Vis scheduler, Setonix vis nodes, Nebula vis nodes, Visualisation Lab, Reservation, CARTA - Stable, CARTA - Test).