Pawsey Supercomputing Research Centre
Update - Monitoring and testing continues today. We hope to make copy two available within 24hours.
May 03, 2024 - 11:53 AWST
Monitoring - Further logs have been gathered and sent to the library hardware company for review and verification and the library is being monitored by Pawsey staff. Further on-site testing by Pawsey staff will be undertaken tomorrow to ensure the library is all good to be returned to service ASAP.
May 01, 2024 - 18:20 AWST
Identified - Unisys have attended site and examined a number of robotics issues. They have remedied these issues and the library is up and being tested before returning to service after it passes a number of tests and remains fault free under monitoring.
May 01, 2024 - 18:17 AWST
Update - This complex issue with Lib2 has been escalated with Spectralogic management to obtain onsite support in order to get the issues with this library fully diagnosed and remedied with parts and updates and back online for second copy use again. Service remains degraded and at risk.
Apr 26, 2024 - 11:19 AWST
Update - We are continuing to investigate this issue.
Apr 22, 2024 - 08:24 AWST
Update - We are identifying why the robotics on Lib 2 have errored with Spectralogic Support.
Apr 22, 2024 - 08:24 AWST
Investigating - The Banksia service is in a degraded/"at risk" state due it operating on only one tape library, rather than two as default. This will mean that the redundant second copy of files will be unavailable for staging or archiving until Library 2 is returned to service.
Apr 22, 2024 - 08:21 AWST
Monitoring - Just before Pawsey was going to return Setonix to service yesterday, HPE performed what should have been a minor hardware replacement on one of the Lustre servers. This resulted In one of the Lustre server crashing, resulting in the onsite HPE engineer spending over twenty-fours hours resolving the issue.

HPE have advised Pawsey the hardware issue is now resolved, and with this assurance Pawsey has removed login restrictions on Setonix and Garrawarla.

Pawsey will monitor both systems to observe the impact of HPE's advised workaround. If you have any issues, please kindly e-mail help@pawsey.org.au.

Mar 12, 2024 - 11:57 AWST
Identified - HPE provided Pawsey a potential workaround for the instability issues experienced by the Scratch filesystem at 11:30 AM on Friday (8th March 2024). Pawsey staff spent all of Friday afternoon rebuilding compute, login and data mover node images and rebooted Setonix to consistently apply the workaround.

Testing has been performed over the weekend and highlighted the data mover node images need to be rebuilt.

The filesystem is being monitored and has so far stayed up.

Mar 11, 2024 - 08:58 AWST
Investigating - All four metadata servers are offline. HPE are investigating.
Mar 07, 2024 - 08:11 AWST
Setonix Operational
90 days ago
94.27 % uptime
Today
Login nodes ? Operational
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Operational
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
90 days ago
94.27 % uptime
Today
Setonix gpu high mem partition Operational
90 days ago
94.27 % uptime
Today
Setonix gpu debug partition Operational
90 days ago
94.27 % uptime
Today
Lustre filesystems Operational
90 days ago
94.03 % uptime
Today
/scratch filesystem (new) ? Operational
90 days ago
76.12 % uptime
Today
/software filesystem ? Operational
90 days ago
100.0 % uptime
Today
/askapbuffer filesystem ? Operational
90 days ago
99.99 % uptime
Today
/askapingest filesystem ? Operational
90 days ago
100.0 % uptime
Today
Storage Systems Operational
90 days ago
100.0 % uptime
Today
Acacia - Projects ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Acacia - Ingest ? Operational
MWA ASVO ? Operational
90 days ago
100.0 % uptime
Today
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Garrawarla Operational
Garrawarla workq partition ? Operational
Garrawarla gpuq partition ? Operational
Garrawarla asvoq partition ? Operational
Garrawarla copyq partition ? Operational
Garrawarla login node Operational
Slurm Controller (Garrawarla) Operational
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Nimbus APIs ? Operational
Central Services Operational
90 days ago
100.0 % uptime
Today
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
90 days ago
100.0 % uptime
Today
Documentation ? Operational
90 days ago
100.0 % uptime
Today
Visualisation Services Operational
Remote Vis ? Operational
Vis scheduler ? Operational
Setonix vis nodes ? Operational
Nebula vis nodes ? Operational
Visualisation Lab Operational
Reservation ? Operational
CARTA - Stable ? Operational
CARTA - Test ? Operational
Pawsey Remote VR Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Scheduled Maintenance
Pawsey Scheduled Maintenance (June) Jun 4, 2024 08:00-20:00 AWST
Maintenance will be carried out on Pawsey systems on Tuesday the 4th of June to apply required patches and updates to improve the systems stability, security, and performance. This maintenance window will also be used to undertake other tasks which require down-time to achieve.

Planned work for this window includes:
• Update the BIOS on the askapbuffer Lustre servers
• Update the controller firmware on the askapbuffer storage arrays
• Update the LNet Monitoring resource agents on askapbuffer
• Update the HCA firmware on the Setonix LNet routers
• Update the BIOS on the Garrawarla data mover nodes
• Rectifying cabling on DDN arrays for Banksia
• Patching of core Pawsey services

During the upcoming June maintenance, the Cray Programming Environment (CPE) on Setonix will be updated to version 23.09. This will include updated MPI libraries and a newer Cray compiler version (16.0.1), and the same GCC compiler version. In preparation for this, the Pawsey team has built the new software stack.

The new version of the software stack will be 2024.05 which will be available by default, and it will sit alongside version 2023.08 that is currently available on Setonix. You can still choose to use the older 2023.08 deployment by unloading the compiler module, swapping the pawseyenv module, and reloading the compiler module. More detail is available on our documentation page: June 2024 Software Update - Important Information.

Based on our testing, researchers should be able to use the existing local software installations without recompilation, but there may be exceptions. In such cases, feel free to reach out to the helpdesk for support.

Please note that the 2022.11 software stack is currently planned to be deprecated during September 2024 maintenance, with the 2023.08 software stack planned to be deprecated in January 2025. We recommend all researchers to migrate to the 2024.05 software stack as soon as possible.

Highlights of the new software stack:
• Software has been installed using a newer Spack major release, v0.21.0.
• Several packages are now also compiled using the Cray compiler.
• Version updates of most packages in both the GNU and Cray programming environments.
• The cpe/23.09 module has been used to compile 2024.05 stack. The cpe/23.03 will also be available as a non-default environment.
• Several newer versions of ROCm up to 5.7.3 will be available.
• ROCm 5.7.3 is recommended and has been used to build GPU packages.

We will publish the changes in more detail in our technical newsletter. Please monitor the documentation page above for further details approaching the maintenance.

Please note that there won’t be any changes to Cray Programming Environment (CPE) on the Setonix Remote Visualisation Nodes. We are aiming to update CPE and the Software stack during the July maintenance.

Posted on May 28, 2024 - 09:03 AWST
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Allocated nodes (Garrawarla workq partition)
Fetching
Past Incidents
May 30, 2024

No incidents reported today.

May 29, 2024

No incidents reported.

May 28, 2024

No incidents reported.

May 27, 2024

No incidents reported.

May 26, 2024

No incidents reported.

May 25, 2024

No incidents reported.

May 24, 2024

No incidents reported.

May 23, 2024

No incidents reported.

May 22, 2024

No incidents reported.

May 21, 2024

No incidents reported.

May 20, 2024

No incidents reported.

May 19, 2024

No incidents reported.

May 18, 2024

No incidents reported.

May 17, 2024

No incidents reported.

May 16, 2024

No incidents reported.