Architecture of High Performance Computing Server at BIT Mesra

A High-Performance Computing (HPC) server was installed a few years back. It was a replacement for PARAM 10000, the supercomputer that is no longer available for use. Initially, the HPC was under the Department of Computer Science. The Department of Chemical Engineering and Biotechnology was the primary user of the HPC (mostly for simulation purposes), and so the administration decided to move it under the Central Instrumentation Facility (CIF). You need permission from the CIF to access the HPC.

HPC is only available for research purposes, and you need to provide a good reason along with a proper recommendation from a professor to gain access to the HPC.

The HPC is at least 20 times more powerful than the most powerful PC that anyone has on campus. Also, I recently checked the usage and realized that not even 10% of its power is being utilized. I hope this blog post will help you in understanding the core architecture of the HPC.

Architecture

The Architecture of High Performance Computing Server

The Computers

HPC is a clustered computer network. There is a master node that is connected to the LAN network of the college and accessible via any computer connected to the LAN or Wi-Fi. The master node is only accessible via SSH (Secure Shell). All other interfaces (like SPI, I2C, or Telnet) are closed. The master node is accessible at 172.16.23.1.

There are 17 compute nodes connected to the master node in a star network topology. The first 16 nodes are CPU-based compute nodes, and the 17th node is a GPU-based compute node. All compute nodes have the same configuration, which is why the HPC is more like a cluster than a distributed system. The compute nodes are where all the code execution takes place, and the master node is the one that distributes the tasks among the compute nodes.

Apart from these, there is a single storage node. The advantage of this single storage node is that a file stored on the master node is available on all the compute nodes.

The Master Node

CPU - 2 x (Intel Xeon E5-2630)

Cores - 8 per CPU

Hyperthreading - Disabled (I don't know if it can be enabled through software)

Virtualization - Available

Total CPUs - 16

Clock speed - 2.4 GHz

Memory - 64 GB

Internal HDD - Total size - 2 TB

Partition 1 - /dev/sde3 ~ 1 TB mounted at /.

Partition 2 - /dev/sdf1 ~500 GB mounted at /apps.

Partition 3 - /dev/sdf2 ~450 GB mounted at /scratch.

Partition 4 - /dev/sde1 ~500 MB mounted at /boot.

Operating System - CentOS 6.5 release

Domain Name - csehpc.bitmesra.ac.in

IP - External - 172.16.23.1 | Internal Primary - 192.168.10.1 | Internal Secondary - 10.10.1.1

The operating system is an old version of CentOS (CentOS is a Linux-based OS closely related to Red Hat Enterprise Linux. If you are familiar with RHEL, then CentOS should be easy to understand). The master node is quite powerful in itself.

It is the only node that has access to the internet (without Cyberoam login). The downlink speed is approximately 10 MB/s. The master node is the only node that is reachable from the external network (college LAN). All other compute nodes are reachable only via the master node.

Here is the output of lscpu:

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 16

On-line CPU(s) list: 0-15

Thread(s) per core: 1

Core(s) per socket: 8

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 63

Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

CPU MHz: 2401.000

BogoMIPS: 4788.55

Virtualization: VT-x

Lid cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 20480K

NUMA node CPU(s): 0-7

NUMA node1 CPU(s): 8-15

The Compute Nodes

CPU - 2 x (Intel Processor.) (The data on the college's website is incorrect. Compute nodes have a slower processor than the Master node.)

Cores - 8 per CPU

Hyperthreading - Disabled (I don't know if it can be enabled through software)

Virtualization - Available

Total CPUs - 16

Clock speed - 1.2 GHz

GPU - Nvidia Tesla K20m. (Only available on GPU node.)

Memory - 64 GB

Internal HDD - Total size - 500 GB

Partition 1 - /dev/sda3 ~415 GB mounted at /.

Partition 2 - /dev/sda1 ~500 MB mounted at /boot.

Operating System - CentOS 6.5 release

Domain Name - csehpc-n[x].bitmesra.ac.in where [x] is the node number from 2-18 (Total 17 nodes).

IP - External - Unreachable | Internal Primary - 192.168.10.[x] | Internal Secondary - 10.10.1.[x]

All compute nodes have the same architecture and can be used for parallel processing. The compute nodes are unreachable from the external network. You must use the master node to access all compute nodes. Furthermore, none of the compute nodes have access to the internet; they cannot even resolve a domain name.

Here is the output of lscpu:

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 16

On-line CPU(s) list: 0-15

Thread(s) per core: 1

Core(s) per socket: 8

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 63

Stepping: 2

CPU MHz: 1200.000

BogoMIPS: 4788.55

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 20480K

NUMA nodeo CPU(s): 0-7

NUMA node1 CPU(s): 8-15

The Storage Node

The storage node has a total capacity of 48 TB. Out of this, 21 TB is unavailable for use as it is used for cloud storage (I think it is reserved for professors). The remaining storage is mounted on /home, which is available across all the nodes (master + compute). This means that any file that is in the /home folder of any one node will be accessible at the same location on other nodes.

How is this helpful? Since the compute nodes do not have internet access, you can download a file on the master node, and that file will be available across all compute nodes.

The Network

All the nodes are connected via an InfiniBand switch. InfiniBand switches are specially designed for HPC servers to provide low latency and high throughput.

There are two internal networks in the HPC:

1. Primary network:

Gateway: 192.168.10.0
Nodes: 192.168.10.1 - 192.168.10.18
Domain: csehpc.bitmesra.ac.in

2. Secondary (Backup) network:

Gateway: 10.10.1.0
Nodes: 10.10.1.1 - 10.10.1.18
Domain: icsehpc

Here is the output of cat /etc/hosts:

127.0.0.1 localhost localhost.localdomain

172.16.23.1 csehpc.local

Primary Network

192.168.10.1 csehpc.bitmesra.ac.in csehpc

192.168.10.2 csehpc-n1.bitmesra.ac.in csehpc-n1

192.168.10.3 csehpc-n2.bitmesra.ac.in csehpc-n2

192.168.10.4 csehpc-n3.bitmesra.ac.in csehpc-n3

192.168.10.5 csehpc-n4.bitmesra.ac.in csehpc-n4

192.168.10.6 csehpc-n5.bitmesra.ac.in csehpc-n5

192.168.10.7 csehpc-n6.bitmesra.ac.in csehpc-n6

192.168.10.8 csehpc-n7.bitmesra.ac.in csehpc-n7

192.168.10.9 csehpc-n8.bitmesra.ac.in csehpc-n8

192.168.10.10 csehpc-n9.bitmesra.ac.in csehpc-n9

192.168.10.11 csehpc-n10.bitmesra.ac.in csehpc-n10

192.168.10.12 csehpc-n11.bitmesra.ac.in csehpc-n11

192.168.10.13 csehpc-n11.bitmesra.ac.in csehpc-n12

192.168.10.14 csehpc-n11.bitmesra.ac.in csehpc-n13

192.168.10.15 csehpc-n11.bitmesra.ac.in csehpc-n14

192.168.10.16 csehpc-n11.bitmesra.ac.in csehpc-n15

192.168.10.17 csehpc-n11.bitmesra.ac.in csehpc-n16

192.168.10.18 csehpc-n11.bitmesra.ac.in csehpc-n17

Secondary Network

10.10.1.1 icsehpc

10.10.1.2 icsehpc-n1

10.10.1.3 icsehpc-n2

10.10.1.4 icsehpc-n3

10.10.1.5 icsehpc-n4

10.10.1.6 icsehpc-n5

10.10.1.7 icsehpc-n6

10.10.1.8 icsehpc-n7

10.10.1.9 icsehpc-n8

10.10.1.10 icsehpc-n9

10.10.1.11 icsehpc-n10

10.10.1.12 icsehpc-n11

10.10.1.13 icsehpc-n12

10.10.1.14 icsehpc-n13

10.10.1.15 icsehpc-n14

10.10.1.16 icsehpc-n15

10.10.1.17 icsehpc-n16

Conclusion

That's all for the architecture of the HPC. I hope that the architecture fascinated you. The HPC is very powerful if you know how to utilize it effectively. In the next article, I will help you set up a Machine Learning environment on a single compute node.

Search This Blog

Nano