PCIe errors from galaxy-int
Identified
The internal login node on galaxy (galaxy-int) that is used for interactive slurm jobs has been logging PCIe bus errors recently. HPE/Cray staff are aware and will investigate further. Although compute blades can be removed for maintenance in a warm-swap operation, the internal login node (c0-0c0s0n2) sits on the same blade as the internal boot node (c0-0c0s0n1) for the system, meaning repair may be a more disruptive process.
Posted Nov 07, 2022 - 06:28 AWST
This incident affects: Legacy Systems (Galaxy Compute nodes).