Pawsey Scheduled Maintenance (March)

Scheduled Maintenance Report for Pawsey Supercomputing Research Centre

Completed

ASKAP Buffer has had disk firmware updates applied.

ASKAP Ingest Cluster is still being worked on. One node has been stressed tested and blessed as ready for service. A second node is still be worked on.

Setonix has been updated to the latest available "Extended Support" version of the Cray Operating System which provides bug fixes and security patches.

Additional versions of the Cray Programming Environment are now available on Setonix (cpe/24.03, cpe/24,07, cpe/24.11). The AMD GPU driver has also been updated from the previous version that supported ROCm 5.5 through 6.1 to a newer version that supports ROCm 5.7 through 6.3. This ensures researchers can continue to make use of the latest GPU features and has allowed an updated software stack built against ROCm 6.2.4 to be provided with new software versions. To minimise disruption, the existing software stack built against ROCm 5.7.3 will remain the default.

To try out the new software stack, we have prepared a documentation page with instructions:
https://pawsey.atlassian.net/wiki/spaces/US/pages/695074817/March+2025+Software+Update+-+Important+Information

Please read this page before using the new software, as we are still finalising some of the packages with details listed.

The post-maintenance test suite detected an issue with our ARM Forge license for DDT which we will follow up with the vendor.

Thank you to the morning crew for shutting down Setonix early this morning and our HPE onsite engineers for starting so early. And thank you to the services team for their hard work in setting up the new software environment and performing testing (Ilkhom Abdurakhmanov, Deva Kumar Deeptimahanti, Craig Meyer and Chris Harris).

As always, be kind, and e-mail our friendly help desk staff (help@pawsey.org.au) if you encounter any issues.
Posted Mar 04, 2025 - 16:57 AWST

Verifying

Banksia has been returned to service.

Visualisation services have been patched.

ASKAP Buffer has had disk firmware updates applied.


Setonix has been updated to the latest available "Extended Support" version of the Cray Operating System which provides bug fixes and security patches.

ASKAP Ingest Cluster is currently being stress tested.
Posted Mar 04, 2025 - 15:24 AWST

Update

Banksia has been returned to service.
Posted Mar 04, 2025 - 11:25 AWST

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Mar 04, 2025 - 06:00 AWST

Update

Maintenance for Setonix has been pushed forward to 6 AM at the request of HPE.
Posted Feb 27, 2025 - 10:30 AWST

Scheduled

Maintenance will be carried out on Pawsey systems on Tuesday the 4th March to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

Planned work for this window includes:
• HPE will be modifying the Lustre Network on scratch and software to allow Pawsey to test multiple Lustre Network Drivers. This has been tested on the Test and Development system and is supported by HPE.
• Setonix will be updated to the latest available "Extended Support" version of the Cray Operating System which provides bug fixes and security patches.
• Banksia will have a ScoutAM upgrade.
• ASKAP Buffer will be having disk firmware updates
• Remedial hardware replacement for ASKAP Ingest Cluster
• Upgrade of DDN 7990 arrays connected to Banksia and replacement of batteries.
• Patching of visualisation services will be undertaken.
• Patching of core Pawsey services will be undertaken.

When Setonix is returned to service, additional versions of the Cray Programming Environment will be available (24.03, 24,07, 24.11). The AMD GPU driver will also be updated from the current version that supports ROCm 5.5, 5.6, 5.7, 6.0, and 6.1 to a newer version that supports ROCm 5.7, 6.0, 6.1, 6.2 and 6.3. This is to ensure researchers can continue to make use of the latest GPU features and allowed an updated software stack to be provided with new software versions and built against ROCm 6.2.4. To minimise disruption, the existing software stack built against ROCm 5.7.3 will remain the default.

We expect to be able to bring all services back by the end of the day (in the case of Setonix, sometime in the evening). If you have any questions, please contact help@pawsey.org.au.
Posted Feb 25, 2025 - 12:00 AWST
This scheduled maintenance affected: ASKAP (ASKAP ingest nodes, ASKAP service nodes), Central Services (Authentication and Authorization, Service Desk, License Server, Application Portal, Origin, /home filesystem, /pawsey filesystem, Central Slurm Database, Documentation), Lustre filesystems (/scratch filesystem, /software filesystem, /askapbuffer filesystem, /askapingest filesystem), Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition), Visualisation Services (Remote Vis, Vis scheduler, Setonix vis nodes, Nebula vis nodes, Visualisation Lab, Reservation, CARTA - Stable, CARTA - Test, Pawsey Remote VR), and Storage Systems (Banksia, Data Portal Systems, CASDA Nodes, MWA Nodes, MWA ASVO).