URZ HPC-Cluster Sofja


rack 6 front, cold side
outside-Sep21, night-view, warm-gangway

Sofja - 288 Nodes Infiniband-Cluster (ca. 800 TFLOPs)

News:

  • Nov2021 - below our switch-on plan (not fixed yet)
    • 24.11.2021 - renaming HPC21 to Sofja
    • 24.11.2021 - 67 active users of t100-hpc copied incl. ssh-authkeys
    • 25.11.2021 - new HPC21/Sofja system available for users
    • 29.11.2021 - switching off old HPC-cluster Dec2015-Nov2021
  • See at History/Timeline for photos of progress
  • Jul2022 to Dec2022++ - massive hardware failures (see history and power-off-problem page)

Short description

HPC means "High Performance Computing" (dt.: "Hochleistungsrechnen"). This Cluster "Sofja" is a HPC cluster for universitary sientific use. It is mainly for parallelized applications with high network communication and high memory demand, i.e. things which do not fit on a single workstation. It is based on linux, the job-scheduler slurm and the MPI library for the high-speed network. The HPC-cluster Sofja replaces the older HPC-system Neumann.

Hardware

Architecture: 292 infiniband-connected ccNuma-nodes
Prozessor (CPU): 2 x 16c/32t Ice Lake Xeon 6326 base=2.9GHz max.turbo=3.5GHz(no_avx?) 512-bit-Vector-support (AVX512 2FMA 2.4GHz) 32 FLOP/clock, 3TFLOP/node (flt64), xhpl=2.0TFLOP/node (testrun, 2.4GHz used), 8 memory-channels/CPU je 3.2GT/s 205 GB/s/CPU, 185 W/CPU
Board: D50TNP1SB (4 boards per 2HE-chassis)
Main Memory (RAM): 256 Gbytes, 16*16GB-DDR4-3200GT/s-ECC Memory-Bandwidth 410 GB/s/Node (4 fat nodes with 1024GB/Node, partition "fat")
Storage (disks): diskless compute nodes, 5 x BeeGFS nodes with dm-encrypted 3*(8+2 RAID6) * 4TB each, ca. 430TB in summary, ca. 2.4GB/s per storage node, 2022 extended to 10 nodes 870TB,
ior-results: home=2.5GB/s (1oss), scratch=10.6GB/s (9oss,1node), scratch=20.4GB/s (9oss,2nodes)
network: Gigabit-Ethernet (management), HDR/2-Infiniband (100Gb/s) non-blocking
Power consumption: ca. 180kW max. (idle: ca. 54kW, test 620W/node +max27% cooling)
Performance data: MemStream: 409 GB/s/node, Triad 71%
MPI: 12.3 GB/s/wire (alltoall uniform, best case)
Peak = 2460 GFLOPs/node (6.9 FLOP/Byte, 4.0 GF/W)
GPU-nodes: 15 nodes added around Apr 2022, one GPU-card per node, partition "gpu"
GPU A30: chipset GA100 24GiB RAM 10TF32 F64=1/2, 7 nodes, part of partition "gpu"
GPU A40: chipset GA102 45GiB RAM 37TF32 F64=1/32, 3 nodes, partition "gpu46GB" (45GiB-12MiB)
GPU A100: chipset GA100 80GiB RAM 20TF32 F64=1/2, 5 nodes, partition "gpu80GB"

Software

User access:

Access via ssh sofja.urz.uni-magdeburg.de (141.44.5.38) is only allowed from within the universitary IP-range. As login please use your universitary account name. It is recommended to use ssh-public-keys for passwordless logins. Please send your ssh public key together with a short description of your project, the project time and the GB of storage you probably need at maximum during your project. Students need a formless confirmation of their universitary tutor, that they are allowed to use central HPC resources for science. This machine is not suited for work with personal data. If you use Windows and Excced für for the access (graphical), take a look to the windows/ssh configuration hints. Please note that the HPC storage is not intended for long time data archievement. There is only some hardware redundance (RAID6) and no backup. We explicite do not backups to not reduce performance of the applications. So far you are responsible to safe your data outside the HPC system. Please remove unneeded data to left more space for others. Thanks! For questions and problems please contact the administration via mailto:Joerg.Schulenburg+hpc21(at)URZ.Uni-Magdeburg.DE?subject=hpc-sofja or Tel.58408 (german or english).

GPU user access:

Since Nov 2024 we have different GPU node management. As a compromize to the typical usecase (deep learning), the security concept is different from the rest of the HPC-cluster.

Privacy/Security

History/Timeline:

Projects:

This is a incomplete list of projects on this cluster to give you an impression, what the cluster is used for.

Questions and Answers:

Problems:

Aug22: after ~8 months usage, lot of nodes (~100) with unintended power off and failed power on until power removed, 3 nodes boot with CPU clock below 230MHz, instable MPI-processes for big jobs (88-140 Nodes (partly fixed at Sep)), LOGs about Fans (some have thousands), PSUs, ECC-UE (uncorrectable errors and no or seldom CE-errors), ECC-CE (correctable errors), false BB Voltage Readings of 14.06 V on 12V line, false PSU Temperature Readings of 63 C or 127 C (clearly outriders), 2 total failing nodes showing above signs before (one showed aging), seems to be random kind of error messages (at some fail state?), seems to become much worse with time (off-problem), seems dynamic clock related (off-problem), changing fmax on all nodes mostly produces power-off failures on some nodes (power-off problem fixed 2023Q1 by FW update, but still thousands of FAKE sensor errors).

Sep22: after CPU fmax changes, found some power supplies (PS) or its sensors go to fail state, ususally they come back by "mc power reset" of 1st node, 2 power supplies died, one failed PS3 caused all 4 connected nodes to run below 230MHz until PS3 removed and nodes replugged, still to much jobs failed because of failed_node, some nodes (3) found speed limited to 800MHz or above but slower than the rest, some jobs hang with MPI communication after about 10-12h normal processing until timeout (speed fixed at Q1 2023 by FW update, PSU/PMBus issue partly improved but not fixed)

Oct22: found gpu-nodes bmc not reachable, if node switched off, that means they can switched off but not switched on remotely (fixed Q1 2023)

Oct22,20th: 18 nodes (6.2%) run below 230MHz (below minimum CPU clock of 800MHz, shown by /proc/cpuinfo), this looks Quad-related because they are grouped to 6 of 72 Quads (8.3%) by 3 + 4 + 1 + 3 + 3 + 4 nodes. 4 nodes of each Quad have the same 3 power supplies (2+1 redundancy, 3*2kW). Speed is 20 to 45 times slower (p.e. 45 min boot time, md5-speed is 45x slower pointing to 53 MHz effective CPU clock). 4 nodes (not counted above) could be brought back to normal speed by disconnecting the Quad from power. This did not work for the above nodes. (temporay fixed Q4 2022 by ipmi raw cmd, and overcurrent cause fixed Q1 2023 by FW update)

Design-problems:
- 3 PSUs per chassis fits badly to 2 PDUs per rack, especially when there is no strong powercapping in case of a failed PDU

Further HPC-Systems:


more infos to central HPC compute servers at the CMS websites (content management system) or at the fall-back OvGU-HPC overview


Author: Joerg Schulenburg, Uni-Magdeburg URZ, Tel. 58408 (2021-2026)