Pawsey Supercomputing Centre
Update - We are still awaiting the replacement part. The expectation is it should arrive in time for scheduled maintenance on Tuesday.
May 27, 08:45 AWST
Update - Todays update from the vendor: - they have requested an escalation from the Intel Support team to expedite on the RMA process, but we still don't have any parts on site or an ETA of them arriving.
May 14, 13:37 AWST
Update - We are still waiting for replacement parts to be shipped to site from our vendor.
May 12, 12:00 AWST
Update - We are in contact with the vendor requesting replacement parts. At this stage, we have not been able to secure an ETA from the vendor.
May 7, 09:39 AWST
Identified - We've identified the failed component and the Omni-Path fabric is stable. The Omni-Path connected Lustre clients are in recovery mode and may take some time to complete.
May 6, 08:58 AWST
Update - We are investigating an issue with the Omnipath switch. This affects certain Zeus partitions. Please see the incident log page for more information: https://support.pawsey.org.au/documentation/display/US/I-2020-05-05-SC
May 5, 14:41 AWST
Investigating - We are investigating an issue with the Omnipath switch.
May 5, 14:05 AWST
Identified - The daily disk usage breakdown by person/group for the /askapbuffer filesystem has not been mailed out to the astronomy group leaders. This seems to be caused by a particularly long database query hanging and locking the report. Staff are working on a workaround.
May 27, 07:35 AWST
Magnus Operational
Magnus Compute nodes ? Operational
Magnus login nodes ? Operational
Slurm (Magnus) ? Operational
Galaxy Operational
Galaxy Compute nodes ? Operational
Galaxy login nodes ? Operational
Slurm (Galaxy) Operational
Topaz Operational
Slurm Controller (topaz) ? Operational
GPU partition ? Operational
Zeus Operational
Zeus login node Operational
Zeus Compute nodes ? Operational
Galaxy ingest nodes ? Operational
Data Mover nodes (CopyQ) ? Operational
Slurm (Zeus) Operational
Lustre filesystems Operational
90 days ago
99.99 % uptime
Today
/scratch filesystem ? Operational
90 days ago
100.0 % uptime
Today
/group filesystem ? Operational
90 days ago
99.99 % uptime
Today
/astro filesystem ? Operational
90 days ago
100.0 % uptime
Today
/askapbuffer filesystem ? Operational
90 days ago
99.98 % uptime
Today
Nimbus Operational
Ceph storage ? Operational
Nimbus instances ? Operational
Nimbus dashboard ? Operational
Storage Systems Operational
Data Portal Systems Operational
Hierarchical Storage Management Systems Operational
MWA Nodes Operational
CASDA Nodes Operational
Central Services Operational
Authentication and Authorization ? Operational
Service Desk Operational
License Server Operational
Central Slurm Database ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
had a major outage
had a partial outage
Active Instances (Nimbus)
Fetching
Active Cores (Nimbus)
Fetching
Allocated Nodes (Magnus) ?
Fetching
Allocated Nodes (Galaxy)
Fetching
Past Incidents
Jun 4, 2020

No incidents reported today.

Jun 3, 2020

No incidents reported.

Jun 2, 2020
Completed - Dear Pawsey Researchers,

Maintenance on the Pawsey compute systems has been completed and the
following services are available for use;

• Magnus
• Galaxy
• Zeus
• Topaz

If you encounter any problems, or have any questions, please
email help@pawsey.org.au
Jun 2, 18:12 AWST
Update - Maintenance on the Pawsey storage systems has been completed and the following services are available for use;

• Pawsey Data Portal
• Mediaflux
• DMF
• NGAS
• CASDA
• RDS Storage
Jun 2, 15:47 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jun 2, 09:15 AWST
Scheduled - Pawsey Technicians will be using the June 2020 maintenance window to undertake preventative work to improve system performance and reliability.
Among other things, it is expected that we will be updating the Operating Systems on CXFS cluster systems, improving QoS on Galaxy, firmware updates on Zeus nodes, filesystem restarts, replacement of cables and rectifiers on the Cray systems, and recompilation of LAMMPS on some environments.
Jun 2, 09:11 AWST
Jun 1, 2020

No incidents reported.

May 31, 2020

No incidents reported.

May 30, 2020

No incidents reported.

May 29, 2020
Completed - mds02 completed around 12:45 AM
May 29, 06:49 AWST
Update - askapfs-MDT0000 has completed without any errors. MDT0001 is still in progress (it was started an hour later)
May 28, 21:20 AWST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
May 28, 19:01 AWST
Scheduled - Pawsey staff plan to run lfsck checks on the metadata servers for the /askapbuffer filesystem. Although this will be done on-line, there may be some performance degredation during this time
May 28, 18:35 AWST
May 28, 2020
May 27, 2020
Resolved - One of the server pair for the /askapbuffer metadata service rebooted overnight. HA handover to its partner doesn't seem to have worked correctly, causing the filesystem to be unresponsive.

Staff have restarted services and are investigating the root cause

askap-fs1-mds01-fence (stonith:fence_ipmilan): Started askap-fs1-mds02.pawsey.org.au
askap-fs1-mds02-fence (stonith:fence_ipmilan): Stopped
askapfs1-MGS (ocf::lustre:Lustre): Stopped
askapfs1-MDT0000 (ocf::lustre:Lustre): Stopped
askapfs1-MDT0001 (ocf::lustre:Lustre): Stopped

Failed Actions:
* askapfs1-MDT0000_start_0 on askap-fs1-mds02.pawsey.org.au 'unknown error' (1): call=26, status=complete, exitreason='',
last-rc-change='Wed May 27 23:24:44 2020', queued=0ms, exec=21222ms
* askapfs1-MGS_start_0 on askap-fs1-mds02.pawsey.org.au 'unknown error' (1): call=24, status=complete, exitreason='',
last-rc-change='Wed May 27 23:24:44 2020', queued=0ms, exec=21363ms
* askapfs1-MDT0001_start_0 on askap-fs1-mds02.pawsey.org.au 'unknown error' (1): call=25, status=complete, exitreason='',
last-rc-change='Wed May 27 23:24:44 2020', queued=0ms, exec=21197ms
May 27, 22:00 AWST
May 26, 2020

No incidents reported.

May 25, 2020

No incidents reported.

May 24, 2020

No incidents reported.

May 23, 2020

No incidents reported.

May 22, 2020

No incidents reported.

May 21, 2020

No incidents reported.