Resolved -
Askapbuffer
* There has been no signs of any volumes being marked as being inactive / or uncontactable after filesytem check was performed on all volumes served by Storage Array 05
* Debrief was given in the RTG group
* The vendor has confirmed from the support bundle, the replacement controller is working per normal
May 1, 11:09 AWST
Monitoring -
OST Filesystem Scan
* File system check was completed on the remainder Array Storage 05 LUNs/OSTs
* Only OST0024 presented issues which has to be corrected
Storage Controller replacement part
* Part Arrived during the remediation
* Storage Array 05 Controller A has been replaced
* Storage Unit looks correct, vendor storage bundle was collected to be submitted to confirm system health
Askap Ingest Cluster
* As with previous, the volumes with the problematic volumes ie inactive was locked and would not reconnect
* Cluster was rebooted to get a clean slate to re-attached the missing OSTs
Setonix
* The Setonix Data movers nodes reconnected to /askapbuffer once it was fully online
* Casda nodes reconnected to /askapbuffer" once it was fully online
Apr 29, 17:12 AWST
Identified -
Storage Volume becoming inactive / locked
* It's has been identified the other volumes that was attached to askapbuffer oss03 from Array 05 is developing similar issues to the other volumes that was checked
* "Askapbuffer" will be going down at 3pm AWST
* Where a filesystem check will be performed on the other OST volumes pertaining to Array 05 that sits on askapbuffer oss03
* Ie OST00[21,23,24,25,27]
* Systems with these volumes mounted on it will literally freeze until the filesystem is released
Apr 29, 14:04 AWST
Update -
Filesystem "/askapbuffer" (5:15pm)
* After e2fsck on ost00[20|22|26] the volumes are now mountable / writable
* Partial nodes in the askapingest cluster was stuck and would not reconnect to the volumes which has been addressed ie
* "lfs check: error: check 'askapfs1-OST0022-osc-ffff9c33b8dd8800': Cannot send after transport endpoint shutdown (108)
* Askapingest cluster nodes were rebooted to get a clean state to enable remounting the filesystem
Apr 28, 17:16 AWST
Update -
Storage Volumes OST00[22|26] have either become readonly or uncontactable
* The volume filesystem check is required for OST00[22|26]
* The storage volume pair will be taken offline to check these volumes
* Systems with these volumes mounted will freeze during this check until the filesystem is restored
After primary checks of OST00[22|26]
* Has been e2fsck
OST0020 has similar issues
* 4:05pm to address OST0020
Apr 28, 15:09 AWST
Update -
Pre-emptive replacement on Controller A for Storage Array 05 is pending
* Vendor has indicated there is backorder for the replacement part and is delayed
Apr 28, 10:12 AWST
Update -
We are continuing to monitor for any further issues.
Apr 23, 11:11 AWST
Update -
Support Logs has been reviewed by the vendor
* Recommendation pre-emptively replacing "Storage Controller A"
* Part will be shipped, where "Storage Controller A" will be replaced in Storage Array 05 in "/askapbuffer" system
Apr 23, 10:45 AWST
Update -
System has been restored
* We just waiting for a vendor review of support bundle logs before we close this incident
Apr 21, 10:17 AWST
Monitoring -
Storage Controller A has been restored for array05
* We are monitoring the storage controller A
* Storage Luns has restored High availability configuration access
Apr 20, 10:10 AWST
Identified -
We have identified an issue with the "Askap Buffer" lustre filesystem where
* Filesystem is functional / usable but in a degraded state
* Storage Array 05 no longer has high availability as "Storage Controller A" is non-functional
* There will be an attempt to remediate "Storage Controller A"
Apr 20, 09:46 AWST