Garrawarla draining due to full OSTs on /astro

Incident Report for Pawsey Supercomputing Research Centre

Resolved

This incident has been resolved.

Posted Jun 19, 2022 - 19:13 AWST

Monitoring

As the 4 OSTs have fallen below 95% full, Nodes have resumed and there are presently 41 jobs running. Migration of files on /astro is still on progress

astrofs-OST0004_UUID 57.7T 51.2T 3.5T 94% /astro[OST:4]
astrofs-OST0009_UUID 57.7T 51.1T 3.6T 94% /astro[OST:9]
astrofs-OST0027_UUID 57.6T 51.9T 2.8T 95% /astro[OST:39]
astrofs-OST002a_UUID 57.6T 51.1T 3.6T 94% /astro[OST:42]

Posted Jun 12, 2022 - 21:39 AWST

Identified

Automated healthchecks have set garrawarla compute nodes to 'drain' as 4 of the 48 OSTs on /astro are over 95% full

$ lfs df -h | grep 9[5-9]%
astrofs-OST0004_UUID 57.7T 52.5T 2.2T 96% /astro[OST:4]
astrofs-OST0009_UUID 57.7T 52.8T 1.9T 97% /astro[OST:9]
astrofs-OST0027_UUID 57.6T 52.8T 1.9T 97% /astro[OST:39]
astrofs-OST002a_UUID 57.6T 53.0T 1.7T 97% /astro[OST:42]

These 4 have been manually set so that no new files will be created on them, and restriping is in progress across /astro to better balance the space. Staff will continue to observe free space and update ticket

Posted Jun 11, 2022 - 18:19 AWST