PCIe errors from galaxy-int
Resolved
Galaxy has been decomissioned.
Posted Aug 01, 2023 - 10:20 AWST
Identified
The internal login node on galaxy (galaxy-int) that is used for interactive slurm jobs has been logging PCIe bus errors recently. HPE/Cray staff are aware and will investigate further. Although compute blades can be removed for maintenance in a warm-swap operation, the internal login node (c0-0c0s0n2) sits on the same blade as the internal boot node (c0-0c0s0n1) for the system, meaning repair may be a more disruptive process.
Posted Nov 07, 2022 - 06:28 AWST