Pawsey Supercomputing Research Centre
Update - Currently, Data Portal and pshell users may be experiencing unusually long wait times downloading files from banksia. This will sometimes end in errors (eg timeout or content not available) however the banksia scheduler will still eventually bring the files back online and you will be able to download when trying again a day or so after the initial attempt. We are currently investigating this with the vendor as a potential scheduler issue.
Jan 23, 2025 - 14:47 AWST
Update - We have resolved most of the instability between Mediaflux and the tape storage system. Some minor issues remain which we are continuing to monitor and work on.
Dec 18, 2024 - 13:39 AWST
Update - Work continues with two Vendors to get this resolved. We will be making and testing some recommended steps today to see if this can be completely resolved.
Dec 13, 2024 - 07:57 AWST
Update - A setting was adjusted on the Mediaflux server and this has improved success rate for obtaining files however some issues still remain which we are still investigating with the vendor.
Dec 12, 2024 - 08:02 AWST
Investigating - We are aware of an issue that may be affecting some users attempting file transfers with Mediaflux (data portal and pshell.) It is currently being investigated.
Dec 11, 2024 - 11:44 AWST
Setonix Operational
Login nodes ? Operational
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Operational
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
Setonix gpu high mem partition Operational
Setonix gpu debug partition Operational
Lustre filesystems Operational
/scratch filesystem (new) ? Operational
/software filesystem ? Operational
/askapbuffer filesystem ? Operational
/askapingest filesystem ? Operational
Storage Systems Operational
Acacia - Projects ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Acacia - Ingest ? Operational
MWA ASVO ? Operational
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Nimbus APIs ? Operational
Central Services Operational
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
Documentation ? Operational
Visualisation Services Operational
Remote Vis ? Operational
Vis scheduler ? Operational
Setonix vis nodes ? Operational
Nebula vis nodes ? Operational
Visualisation Lab Operational
Reservation ? Operational
CARTA - Stable ? Operational
CARTA - Test ? Operational
Pawsey Remote VR Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance

Scheduled Maintenance

Pawsey Scheduled Maintenance (April) Apr 7, 2025 15:00 - Apr 8, 2025 21:00 AWST

April Maintenance has been pushed back a week to support the staged upgrade of Setonix to the next "Extended Support Release". The first step is to upgrade the Slingshot fabric manager to the latest supported release which will provide HPE advanced monitoring capabilities and the ability to replace faulty Slingshot switches while the fabric is in production. The version of Slingshot Host Software running on Setonix nodes will not be upgraded at this stage, but will be upgraded as part of the Cray Operating System upgrade scheduled for later in the year.

As HPE engineers who will be performing the upgrade are in a different time zone to Perth, maintenance is currently scheduled to start at 5 PM, 7 April 2025. Pawsey is currently working with HPE to lock in timings and minimise the amount of downtime for Setonix.

To minimise impact to researchers, all other Pawsey maintenance will be conducted on Tuesday April 8th 2025. Further details will be provided next week.

If you have any questions, please reach out to help@pawsey.org.au.

Posted on Mar 25, 2025 - 09:15 AWST
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Mar 31, 2025

No incidents reported today.

Mar 30, 2025

No incidents reported.

Mar 29, 2025

No incidents reported.

Mar 28, 2025

No incidents reported.

Mar 27, 2025
Resolved - Writes have been re-enabled for the 4 OSTs that were previously 100% full. /scratch is now back to it's normal configuration.
Mar 27, 09:05 AWST
Monitoring - Bulk deletion of older files from /scratch (see the policy at https://pawsey.atlassian.net/wiki/display/US//Filesystem+Policies) has increased the free capacity of the flash pool to over 100 TB. Individual OSTs within this pool are still between 90 and 95% full and researchers are requested to delete any unneeded files from /scratch to prevent jobs failing when they cannot write output.
Mar 27, 05:56 AWST
Investigating - The /scratch filesystem used by setonix is composed of two pools - a high performance flash component, and a much larger but slower disk component. The flash pool is presently 98% full with some individual OSTs having only a few tens of GB free. While there is still plenty of capacity (over 3PB) free on the disk pool, users may see jobs fail with write errors, especially if they are overriding the default striping applied by Pawsey.
Mar 26, 04:35 AWST
Mar 26, 2025
Mar 25, 2025

No incidents reported.

Mar 24, 2025

No incidents reported.

Mar 23, 2025

No incidents reported.

Mar 22, 2025

No incidents reported.

Mar 21, 2025

No incidents reported.

Mar 20, 2025

No incidents reported.

Mar 19, 2025

No incidents reported.

Mar 18, 2025

No incidents reported.

Mar 17, 2025
Resolved - This incident has been resolved.
Mar 17, 10:35 AWST
Investigating - We are currently investigating this issue, it has been logged with the vendor as a P1.
Mar 17, 10:07 AWST