Get webhook notifications whenever Pawsey Supercomputing Centre creates an incident, updates an incident, resolves an incident or changes a component status.
Monitoring - We are still waiting for replacement parts for this issue. Our vendor supply has been affected by Covid19 but we are expecting parts soon.
Dec 7, 15:31 AWST
Investigating - Automated alerting has just indicated there may be a high speed network issue in Galaxy. It looks like one of the blades (containing nid00500- to nid00503]) went offline around 6pm Perth time.
Staff will investigate in the morning, but jobs running at that time may be impacted
Oct 21, 18:10 AWST
Magnus
Operational
Magnus Compute nodes
?
Operational
Magnus login nodes
?
Operational
Slurm Controller (Magnus)
?
Operational
Galaxy
Operational
Galaxy Compute nodes
?
Operational
Galaxy login nodes
?
Operational
Slurm Controller (Galaxy)
Operational
Topaz
Operational
Slurm Controller (topaz)
?
Operational
GPU partition
?
Operational
Garrawarla
Operational
Garrawarla compute nodes
?
Operational
Slurm Controller (Garrawarla)
Operational
Zeus
Operational
Zeus login node
Operational
Zeus Compute nodes
?
Operational
Galaxy ingest nodes
?
Operational
Data Mover nodes (CopyQ)
?
Operational
Slurm Controller (Zeus)
Operational
Central Slurm Database
?
Operational
Lustre filesystems
Operational
90 days ago
99.89
% uptime
Today
/scratch filesystem
?
Operational
90 days ago
100.0
% uptime
Today
/group filesystem
?
Operational
90 days ago
100.0
% uptime
Today
/astro filesystem
?
Operational
90 days ago
99.58
% uptime
Today
/askapbuffer filesystem
?
Operational
90 days ago
100.0
% uptime
Today
Nimbus
Operational
Ceph storage
?
Operational
Nimbus instances
?
Operational
Nimbus dashboard
?
Operational
Storage Systems
Operational
Data Portal Systems
Operational
Hierarchical Storage Management Systems
Operational
MWA Nodes
Operational
CASDA Nodes
Operational
Central Services
Operational
Authentication and Authorization
?
Operational
Service Desk
Operational
License Server
Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Related
No incidents or maintenance related to this downtime.
Resolved -
One compute node which is part of the Nimbus infrastructure rebooted unexpectedly. Because of the way the hypervisor rebooted, all root and data volumes of the instances on that node still had locks on them so they could not start up again properly.
We unlocked all of those volumes and restarted the instances. We apologise for the inconvenience.
Jan 21, 14:00 AWST
Resolved -
This incident has been resolved.
Jan 12, 05:27 AWST
Identified -
We have identified at least one issue with the OSS connection to the storage array, and have flagged a second (unrelated) configuration issue to the vendor who will investigate.
Dec 18, 12:51 AWST
Investigating -
One of the servers that provide the /astro filesystem has been under high load and logging errors. Suspicion is that there's an underlying issue with the storage array connected to it due to the type of errors being logged. Staff will investigate further when onsite and escalate to the vendor
Dec 17, 08:30 AWST
Resolved -
No further issues seen with the two MDS servers throughout the day.
Jan 12, 05:26 AWST
Monitoring -
Both MDSs seem to be performing normally following a reboot
Jan 11, 07:01 AWST
Identified -
The pair of servers that make up the metadata service for /astro (running the MGS, and both MDTs) aren't handling failover correctly. One server rebooted earlier this morning but the disk LUNs aren't being correctly handed off to the partner. This will require a reboot of one, possibly both servers. During this time access to /astro will be impacted.
Jan 11, 04:14 AWST