Pawsey Scheduled Maintenance (December)
Scheduled Maintenance Report for Pawsey Supercomputing Research Centre
Completed
Banksia has been returned to service, but with a single library so that SpectraLogic can diagnose hardware issues with the second library.

Garrawarla has been returned to service. Please remember that Garrawarla will be shutdown on the 2nd January 2025.

Setonix has been returned to service. Please note:
• the gpu-dev partition now contains 10 nodes with a limit of 1 running job and 4 queued jobs per user (and 2 nodes per job).
• the slurm module is no longer loaded by default, as all the environment variables it set are already in the pawsey module.
• the slurm controller configuration has been tweaked to weight the size of the job (Pawsey will monitor this change and determine the final weightings in the new year).

Just after the maintenance reservation was removed on Setonix, one of the cooling units in Setonix detected a pressure issue and performed an emergency shutdown of nodes in three cabinets. HPE have restored the cooling unit to full functionality, however some jobs running on the affected nodes would have been killed. HPE is investigating the issue to determine the root cause.

There will be no maintenance in January. Merry Chrimbo one and all.
Posted Dec 03, 2024 - 17:47 AWST
Update
One of the cooling units for Setonix has detected an issue which has resulted in a number of nodes being shutdown. HPE are investigating the issue.
Posted Dec 03, 2024 - 15:40 AWST
Verifying
Setonix has been returned to service. Please note:
• the gpu-dev partition now contains 10 nodes with a limit of 1 running job and 4 queued jobs per user (and 2 nodes per job).
• the slurm module is no longer loaded by default, as all the environment variables it set are already in the pawsey module.
• the slurm controller configuration has been tweaked to weight the size of the job (Pawsey will monitor this change and determine the final weightings in the new year).

Garrawarla has been returned to service. Please remember it will be shutdown on the 2nd January 2025.

Banksia is having its final verification performed, and will be brought back into service with only one of the libraries.
Posted Dec 03, 2024 - 15:13 AWST
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Dec 03, 2024 - 05:00 AWST
Scheduled
Maintenance will be carried out on Pawsey systems on Tuesday the 3rd December to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

The day before maintenance the firewall will be updated to patch a security issue. Due to its high availability design, this update should be transparent to researchers.

Planned work for this window includes:
• HPE will be shutting down the Slingshot fabric to integrate four additional switches into the fabric.
• HPE will updating the Cabinet Controllers, Node Controllers and BIOS on CPU and GPU nodes to the latest supported version.
• Banksia will have a ScoutAM upgrade.
• Spectralogic drive firmware upgrade.
• Spectralogic Library Control Module (LCM) replacement.
• Patching of visualisation services will be undertaken.
• Patching of core Pawsey services will be undertaken.

When Setonix is returned to service:
• the gpu-dev partition will contain 10 nodes with a limit of 1 running job and 4 queued jobs per user (and 2 nodes per job).
• the slurm module will no longer be loaded by default, as all the environment variables it set are already in the pawsey module.
• the slurm controller configuration will be tweaked to weight the size of the job (Pawsey will monitor this change and determine the final weightings in the new year).

We expect to be able to bring all services back by the end of the day. If you have any questions, please contact help@pawsey.org.au.
Posted Nov 25, 2024 - 15:51 AWST
This scheduled maintenance affected: Garrawarla (Garrawarla workq partition, Garrawarla gpuq partition, Garrawarla asvoq partition, Garrawarla copyq partition, Garrawarla login node), Lustre filesystems (/scratch filesystem (new), /software filesystem), Setonix (Login nodes, Data-mover nodes, Slurm scheduler, Setonix work partition, Setonix debug partition, Setonix long partition, Setonix copy partition, Setonix askaprt partition, Setonix highmem partition, Setonix gpu partition, Setonix gpu high mem partition, Setonix gpu debug partition), Visualisation Services (Remote Vis, Vis scheduler, Setonix vis nodes, Nebula vis nodes, Visualisation Lab, Reservation, CARTA - Stable, CARTA - Test, Pawsey Remote VR), and Storage Systems (Banksia, Data Portal Systems, MWA Nodes, CASDA Nodes, MWA ASVO).