Pawsey Supercomputing Centre
All Systems Operational
Magnus Operational
Magnus Compute nodes ? Operational
Magnus login nodes ? Operational
Slurm (Magnus) ? Operational
Galaxy Operational
Galaxy Compute nodes ? Operational
Galaxy login nodes ? Operational
Slurm (Galaxy) Operational
Topaz Operational
Slurm Controller (topaz) ? Operational
GPU partition ? Operational
Zeus Operational
Zeus login node Operational
Zeus Compute nodes ? Operational
Galaxy ingest nodes ? Operational
Data Mover nodes (CopyQ) ? Operational
Slurm (Zeus) Operational
Central Slurm Database ? Operational
Lustre filesystems Operational
90 days ago
98.87 % uptime
Today
/scratch filesystem ? Operational
90 days ago
98.86 % uptime
Today
/group filesystem ? Operational
90 days ago
98.86 % uptime
Today
/astro filesystem ? Operational
90 days ago
98.87 % uptime
Today
/askapbuffer filesystem ? Operational
90 days ago
98.87 % uptime
Today
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Storage Systems Operational
Data Portal Systems Operational
Hierarchical Storage Management Systems Operational
MWA Nodes Operational
CASDA Nodes Operational
Central Services Operational
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage
had a partial outage
Scheduled Maintenance
Extended monthly maintenance Sep 1, 07:00 - Sep 2, 19:00 AWST
In preparation for installing new equipment in the Pawsey Centre as part of the capital refresh, we need to perform work on the electrical supply to the computer rooms. This requires us to shut down a substantial amount of equipment, hence why this is a longer than usual maintenance window.
Posted on Aug 3, 11:48 AWST
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Allocated Nodes (Magnus) ?
Fetching
Allocated Nodes (Galaxy)
Fetching
Past Incidents
Aug 3, 2020

No incidents reported today.

Aug 2, 2020

No incidents reported.

Aug 1, 2020

No incidents reported.

Jul 31, 2020

No incidents reported.

Jul 30, 2020

No incidents reported.

Jul 29, 2020

No incidents reported.

Jul 28, 2020

No incidents reported.

Jul 27, 2020

No incidents reported.

Jul 26, 2020

No incidents reported.

Jul 25, 2020

No incidents reported.

Jul 24, 2020
Postmortem - Read details
Jul 25, 18:43 AWST
Resolved - We were unable to successfully restart the slurm controller (slurmctld) for zeus without clearing the Slurm saved state. This means that any running jobs would have been lost around 10:15 this morning, together with any queued jobs.
Users are asked to resubmit any jobs back into the scheduler that were lost.
Jul 24, 10:28 AWST
Investigating - We are currently investigating an issue with the zeus slurm controller which has been logging errors since 03:30 this morning
Jul 24, 06:59 AWST
Jul 23, 2020

No incidents reported.

Jul 22, 2020

No incidents reported.

Jul 21, 2020

No incidents reported.

Jul 20, 2020

No incidents reported.