Askapbuffer - Degraded - Storage Array 05 Controller A is non-functional

Resolved

Askapbuffer
* There has been no signs of any volumes being marked as being inactive / or uncontactable after filesytem check was performed on all volumes served by Storage Array 05
* Debrief was given in the RTG group
* The vendor has confirmed from the support bundle, the replacement controller is working per normal
Posted May 01, 2026 - 11:09 AWST

Monitoring

OST Filesystem Scan
* File system check was completed on the remainder Array Storage 05 LUNs/OSTs
* Only OST0024 presented issues which has to be corrected

Storage Controller replacement part
* Part Arrived during the remediation
* Storage Array 05 Controller A has been replaced
* Storage Unit looks correct, vendor storage bundle was collected to be submitted to confirm system health

Askap Ingest Cluster
* As with previous, the volumes with the problematic volumes ie inactive was locked and would not reconnect
* Cluster was rebooted to get a clean slate to re-attached the missing OSTs

Setonix
* The Setonix Data movers nodes reconnected to /askapbuffer once it was fully online
* Casda nodes reconnected to /askapbuffer" once it was fully online
Posted Apr 29, 2026 - 17:12 AWST

Identified

Storage Volume becoming inactive / locked
* It's has been identified the other volumes that was attached to askapbuffer oss03 from Array 05 is developing similar issues to the other volumes that was checked
* "Askapbuffer" will be going down at 3pm AWST
* Where a filesystem check will be performed on the other OST volumes pertaining to Array 05 that sits on askapbuffer oss03
* Ie OST00[21,23,24,25,27]
* Systems with these volumes mounted on it will literally freeze until the filesystem is released
Posted Apr 29, 2026 - 14:04 AWST

Update

Filesystem "/askapbuffer" (5:15pm)
* After e2fsck on ost00[20|22|26] the volumes are now mountable / writable
* Partial nodes in the askapingest cluster was stuck and would not reconnect to the volumes which has been addressed ie
* "lfs check: error: check 'askapfs1-OST0022-osc-ffff9c33b8dd8800': Cannot send after transport endpoint shutdown (108)
* Askapingest cluster nodes were rebooted to get a clean state to enable remounting the filesystem
Posted Apr 28, 2026 - 17:16 AWST

Update

Storage Volumes OST00[22|26] have either become readonly or uncontactable
* The volume filesystem check is required for OST00[22|26]
* The storage volume pair will be taken offline to check these volumes
* Systems with these volumes mounted will freeze during this check until the filesystem is restored

After primary checks of OST00[22|26]
* Has been e2fsck

OST0020 has similar issues
* 4:05pm to address OST0020
Posted Apr 28, 2026 - 15:09 AWST

Update

Pre-emptive replacement on Controller A for Storage Array 05 is pending
* Vendor has indicated there is backorder for the replacement part and is delayed
Posted Apr 28, 2026 - 10:12 AWST

Update

We are continuing to monitor for any further issues.
Posted Apr 23, 2026 - 11:11 AWST

Update

Support Logs has been reviewed by the vendor
* Recommendation pre-emptively replacing "Storage Controller A"
* Part will be shipped, where "Storage Controller A" will be replaced in Storage Array 05 in "/askapbuffer" system
Posted Apr 23, 2026 - 10:45 AWST

Update

System has been restored
* We just waiting for a vendor review of support bundle logs before we close this incident
Posted Apr 21, 2026 - 10:17 AWST

Monitoring

Storage Controller A has been restored for array05
* We are monitoring the storage controller A
* Storage Luns has restored High availability configuration access
Posted Apr 20, 2026 - 10:10 AWST

Identified

We have identified an issue with the "Askap Buffer" lustre filesystem where
* Filesystem is functional / usable but in a degraded state
* Storage Array 05 no longer has high availability as "Storage Controller A" is non-functional
* There will be an attempt to remediate "Storage Controller A"
Posted Apr 20, 2026 - 09:46 AWST
This incident affected: Lustre filesystems (/askapbuffer filesystem).