Acacia Ingest performance heavily degraded and unreliable

Incident Report for Pawsey Supercomputing Research Centre

Resolved

Tests have verified that the system is working reliably.

Internal monitoring shows that the cluster is healthy. Scrubbing is 98% caught up. The final 2% of scrubbing and possible subsequent repair may have some impact on cluster performance.

CASDA and MWA have begun transferring workloads back to Acacia Ingest, and we will continue to consult with them regarding ramping up the load on the system.

Posted Feb 16, 2024 - 16:59 AWST

Monitoring

Exhaustive examination of the data storage pieces revealed a set which shared a common drive. The daemon serving data from this drive had a silent failure, likely resulting from the degradation of the cluster. This was not automatically resolved when the cluster was stabilised, and had to be found manually.

Once we restarted this daemon, several measures improved markedly:
* Those storage pieces responded normally.
* Garbage collection began to clear.
* Bucket listing worked properly.
* Read and write tests have all succeeded reliably.

Scrubbing is 97% caught up, and further damaged objects have been repaired. This will continue and is not expected to cause any further concerns with operations. The garbage collection backlog is now cleared.

We will continue to test the system and consult with MWA and CASDA about bringing workloads back online.

Posted Feb 15, 2024 - 14:38 AWST

Update

Degraded objects were backfilled to zero on 7th Feb 2024. After the reactivation of the automatic balancer and other direct intervention, the cluster could then be scrubbed, rebalanced and tuned.

Scrubbing is typically a slow background operation but because very little has been able to be done for weeks, there is a backlog to process. We accelerated the process throughout the weekend, and it is now 90% caught up. All damaged objects discovered by scrubbing have now been repaired. Another tuning operation was completed on the morning of 12th Feb 2024.

The above completed tasks have each provided cumulative improvements to the Rados gateway (RGW) services. Their performance is now at the expected level, although it is not yet considered reliable.

There are still items to address, including:

* The remaining 10% of scrubs need to be finalised.

* Garbage collection and listing buckets are ongoing problems and may be related to each other.

* RGW logging and tracing shows some errors and some orphaned objects which warrant further investigation.

The improved performance we are seeing following the completed tasks above now allows us to offer access to Acacia Ingest for READ access only (no data should be deleted). We will discuss details of this with MWA and CASDA in the coming days.

Data ingestion is still generally unavailable, but Pawsey plan to commence user testing of ingest this week.

Posted Feb 12, 2024 - 18:37 AWST

Update

Acacia Ingest Cluster is broadly made up of 2 layers. The top layer is effectively the user interface and contains the load balancers and the Rados gateways providing S3 access, while the bottom layer is the core Ceph storage which manages and monitors the cluster, decides which pieces of data go to which storage nodes, and looks after the health and stability of those nodes and their data.

It is important to note that there has been NO DATA LOSS, due to the resilient design of the system. S3 objects are comprised of pieces, and each piece of data on the Ingest cluster is stored as 11 chunks, each one in a separate place. Three chunks can be lost before that piece goes to read only mode to protect itself. At no time in the entire life of the cluster have any more than two missing chunks of any data been observed.

The storage layer been degraded for some time which was exacerbated by the EPO event in December when a number of servers failed due to the high temperature conditions. We have additionally suffered a greater than usual number of server failures since the EPO which we suspect is due to the high temperature conditions experienced in December. This situation peaked following the high voltage maintenance outage on 23 January 2024, leading the service performance to become effectively unuseable.

Recovery operations continue, and as they stabilise this will allow discovery as to whether the user interface layer is exhibiting its own problems, or has been returning 504 gateway timeout errors because of the degraded storage layer.

Posted Feb 06, 2024 - 17:10 AWST

Update

As communicated directly with key stakeholders, the service remains offline while rebalancing and other fault finding continues. Alternative arrangements have been made meanwhile for object storage, which will assist with some aspects of operations.

Posted Feb 05, 2024 - 20:03 AWST

Investigating

Unfortunately the tested change did not have the desired outcome, however it did help discover further issues which warrant further investigation. With the current rebalancing requirements and performance degraded to the point of timeouts, a decision has been taken to take the service offline for the weekend.

Further updates will be provided on Monday.

Posted Feb 02, 2024 - 14:44 AWST

Identified

Pawsey staff believe they have identified the cause of the performance degradation and are currently testing a configuration change.

Posted Jan 31, 2024 - 12:54 AWST

Investigating

Performance on the Rados gateways has become unreliable. We are investigating the issue.

Posted Jan 30, 2024 - 09:43 AWST

This incident affected: Storage Systems (Acacia - Ingest).