Pawsey Supercomputing Research Centre
Update - We are working with MWA and CASDA to investigate the cause of the slow read performance.

There was originally some concern that the MWA ingest was impacting the CASDA read performance, however MWA halved their ingest workload this morning and this did not show a noticeable change to the read behaviour. This was true in reverse also - MWA increased ingest from half to full yesterday morning, and the read misbehaviour didn't show until around 1pm.

At this point in the investigation, it appears that we are hitting some limits in overall Acacia (both clusters) throughput, due to increased usage. Although they are separate clusters, they share some infrastructure. We are preparing some changes to address this, but these will not be able to be implemented this week.

Mar 01, 2024 - 12:41 AWST
Investigating - We have received reports of very slow read workloads on the Ingest cluster, and are investigating.
Feb 29, 2024 - 17:23 AWST
Update - We are continuing to monitor for any further issues.
Feb 21, 2024 - 09:18 AWST
Update - setonix-03 has been resurrected and placed back into the the round robin.
Jan 03, 2024 - 12:08 AWST
Update - HPE have recommended that /software and /scratch are mounted using localflock rather than flock. We have implemented the change across Setonix and Garrawarla, and have run our internal testing which passed.

We will monitor the systems but if you run into any issues with file locking, please reach out to help@pawsey.org.au.

Please note that setonix-03 is currently unavailable due to an internal configuration issue and we are removing it from the round robin DNS.

Jan 02, 2024 - 13:41 AWST
Update - Once again the same two servers have powered off. A reservation has been placed across Setonix. We are waiting to hear back from HPE R&D.
Dec 22, 2023 - 16:09 AWST
Update - The Vendor has restarted the two lustre servers again this afternoon.
Dec 22, 2023 - 14:58 AWST
Update - The Vendor has restarted the two servers this morning and setonix-01 login node is responsive again. The slurm partitions remain drained until health checks on the compute nodes have completed.
Dec 22, 2023 - 06:58 AWST
Update - Once again the same two servers have powered off overnight. This has been escalated to the Vendor
Dec 22, 2023 - 05:26 AWST
Monitoring - HPE have restored the two Lustre Metadataservers. Pawsey has turned off quota enforcement as a precaution and have removed the reservation.

We will observe the system over the weekend, but are unlikely to get a root cause analysis from HPE until Tuesday.

Dec 17, 2023 - 09:29 AWST
Identified - A reservation has been put in place across ALL nodes to prevent new jobs from starting until /scratch is restored
Dec 16, 2023 - 16:20 AWST
Investigating - Both nodes that are capable of housing the Lustre Management Service (MGS) for /scratch are powered off. The net result of this is that /scratch is unavailable for users. A case has been logged with our vendor, however it's unlikely to be resolved over the weekend
Dec 16, 2023 - 15:57 AWST
Setonix Degraded Performance
90 days ago
90.51 % uptime
Today
Login nodes ? Operational
Data-mover nodes ? Operational
Slurm scheduler ? Operational
Setonix work partition Operational
Setonix debug partition Operational
Setonix long partition Degraded Performance
Setonix copy partition Operational
Setonix askaprt partition Operational
Setonix highmem partition Operational
Setonix gpu partition Operational
90 days ago
90.51 % uptime
Today
Setonix gpu high mem partition Operational
90 days ago
90.51 % uptime
Today
Setonix gpu debug partition Operational
90 days ago
90.51 % uptime
Today
Lustre filesystems Operational
90 days ago
93.6 % uptime
Today
/scratch filesystem (new) ? Operational
90 days ago
89.63 % uptime
Today
/software filesystem ? Operational
90 days ago
90.51 % uptime
Today
/askapbuffer filesystem ? Operational
90 days ago
97.14 % uptime
Today
/askapingest filesystem ? Operational
90 days ago
97.14 % uptime
Today
Storage Systems Degraded Performance
90 days ago
94.4 % uptime
Today
Acacia - Projects ? Operational
Banksia ? Operational
Data Portal Systems ? Operational
MWA Nodes Operational
CASDA Nodes Operational
Acacia - Ingest ? Degraded Performance
MWA ASVO ? Operational
90 days ago
94.4 % uptime
Today
ASKAP Operational
ASKAP ingest nodes ? Operational
ASKAP service nodes Operational
Garrawarla Operational
Garrawarla workq partition ? Operational
Garrawarla gpuq partition ? Operational
Garrawarla asvoq partition ? Operational
Garrawarla copyq partition ? Operational
Garrawarla login node Operational
Slurm Controller (Garrawarla) Operational
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Nimbus APIs ? Operational
Central Services Operational
90 days ago
95.3 % uptime
Today
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Application Portal ? Operational
Origin ? Operational
/home filesystem Operational
/pawsey filesystem Operational
Central Slurm Database ? Operational
90 days ago
97.14 % uptime
Today
Nebula ? Operational
90 days ago
91.64 % uptime
Today
Documentation ? Operational
90 days ago
97.14 % uptime
Today
Visualisation Services Operational
Remote Vis ? Operational
Nebula ? Operational
Visualisation Lab Operational
The Australian Biocommons Operational
Fgenesh++ ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Scheduled Maintenance
Pawsey Scheduled Maintenance (March) Mar 5, 2024 07:00 - Mar 7, 2024 19:00 AWST
Maintenance will be carried out on Setonix and Garrawarla on Tuesday the 5th of March to allow HPE to update the firmware of the scratch filesystem to restore the previous file locking configuration. HPE estimates the work will take two days, so the expected return to service is sometime on Thursday, 7th of March.

Please note after this work quota enforcement will be re-enabled (limiting users to 2 million files on scratch), cluster wide file locking with be re-enabled (flock) and the mc client will be removed from Setonix.

We will also be updating the GPU driver on all GPU nodes to the version which comes with ROCm 5.5.3. This will allow a new version of PyTorch container to be installed on Setonix

We appreciate your support and ask if you have any questions, please e-mail help@pawsey.org.au.

Posted on Feb 27, 2024 - 09:51 AWST
Allocated Cores (Setonix)
Fetching
Allocated Nodes (Setonix work partition)
Fetching
Allocated nodes (Setonix askaprt partition) ?
Fetching
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Allocated nodes (Garrawarla workq partition)
Fetching
Past Incidents
Mar 2, 2024

No incidents reported today.

Mar 1, 2024

Unresolved incident: Acacia Ingest read workloads very slow.

Feb 29, 2024
Feb 28, 2024

No incidents reported.

Feb 27, 2024

No incidents reported.

Feb 26, 2024

No incidents reported.

Feb 25, 2024

No incidents reported.

Feb 24, 2024

No incidents reported.

Feb 23, 2024

No incidents reported.

Feb 22, 2024

No incidents reported.

Feb 21, 2024

Unresolved incident: /scratch issues.

Feb 20, 2024
Completed - The scheduled maintenance has been completed.
Feb 20, 16:00 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Feb 20, 09:00 AWST
Scheduled - Dear Pawsey Researchers,

Both the Pawsey Helpdesk system and Pawsey documentation portal are hosted on software developed by Atlassian, which are called Jira and Confluence respectively. Due to changes in the offerings from Atlassian, we will move towards their cloud-based solution. In the coming weeks you may notice some changes to the visuals and how to access these services. We will work to minimise the impact on users and keep you informed on anything that might affect how you use these systems.

What does this mean for me?

Date of Migration:
20th February 2024.

Documentation:
You may see some cosmetic changes but the documentation portal functionality will have minimal modifications. It will be available at https://pawsey.atlassian.net/wiki/spaces/US/overview

Helpdesk:
If you contact the Pawsey helpdesk by emailing help@pawsey.org.au, then nothing will change. However, if you use the web portal to lodge and track your helpdesk tickets then you will see both some cosmetic changes and some changes to how you access it:

1. Instead of using your Pawsey logon, you must log on using an Atlassian account. If your institution also uses Atlassian products and have enabled single sign-on, you may be able to log in to the Pawsey helpdesk using your institutional details.
2. Changing your Helpdesk password will now be done via the Atlassian “forgotten password” link, which may redirect you to your institution if you are using your institutional single sign-on.
3. If you are using the same email address that you normally contact us with to log in, you will be able to see your existing tickets. If you are using a different email address, please contact us and we can help link your tickets to you.
4. If you do not have an Atlassian account or institutional single sign-on, you will need to create an Atlassian account {place a link here}. You may find it easiest to use the same email address you normally use to contact the Pawsey helpdesk

We will keep you updated with the progress on the day of migration.

If you have any problems please contact us at help@pawsey.org.au (which delivers to a mailbox that can be read independently of the Jira helpdesk system).

Feb 12, 15:32 AWST
Feb 19, 2024

No incidents reported.

Feb 18, 2024

No incidents reported.

Feb 17, 2024

No incidents reported.