Architecture of High Performance Computing Server at BIT Mesra
A High-Performance Computing (HPC) server was installed a few years back. It was a replacement for PARAM 10000, the supercomputer that is no longer available for use. Initially, the HPC was under the Department of Computer Science. The Department of Chemical Engineering and Biotechnology was the primary user of the HPC (mostly for simulation purposes), and so the administration decided to move it under the Central Instrumentation Facility (CIF). You need permission from the CIF to access the HPC.
HPC is only available for research purposes, and you need to provide a good reason along with a proper recommendation from a professor to gain access to the HPC.
The HPC is at least 20 times more powerful than the most powerful PC that anyone has on campus. Also, I recently checked the usage and realized that not even 10% of its power is being utilized. I hope this blog post will help you in understanding the core architecture of the HPC.
Architecture
The Architecture of High Performance Computing Server |
The Computers
HPC is a clustered computer network. There is a master node that is connected to the LAN network of the college and accessible via any computer connected to the LAN or Wi-Fi. The master node is only accessible via SSH (Secure Shell). All other interfaces (like SPI, I2C, or Telnet) are closed. The master node is accessible at 172.16.23.1.
There are 17 compute nodes connected to the master node in a star network topology. The first 16 nodes are CPU-based compute nodes, and the 17th node is a GPU-based compute node. All compute nodes have the same configuration, which is why the HPC is more like a cluster than a distributed system. The compute nodes are where all the code execution takes place, and the master node is the one that distributes the tasks among the compute nodes.
Apart from these, there is a single storage node. The advantage of this single storage node is that a file stored on the master node is available on all the compute nodes.
The Master Node
CPU - 2 x (Intel Xeon E5-2630)
Cores - 8 per CPU
Hyperthreading - Disabled (I don't know if it can be enabled through software)
Virtualization - Available
Total CPUs - 16
Clock speed - 2.4 GHz
Memory - 64 GB
Internal HDD - Total size - 2 TB
Partition 1 - /dev/sde3 ~ 1 TB mounted at /.
Partition 2 - /dev/sdf1 ~500 GB mounted at /apps.
Partition 3 - /dev/sdf2 ~450 GB mounted at /scratch.
Partition 4 - /dev/sde1 ~500 MB mounted at /boot.
Operating System - CentOS 6.5 release
Domain Name - csehpc.bitmesra.ac.in
IP - External - 172.16.23.1 | Internal Primary - 192.168.10.1 | Internal Secondary - 10.10.1.1
The operating system is an old version of CentOS (CentOS is a Linux-based OS closely related to Red Hat Enterprise Linux. If you are familiar with RHEL, then CentOS should be easy to understand). The master node is quite powerful in itself.
It is the only node that has access to the internet (without Cyberoam login). The downlink speed is approximately 10 MB/s. The master node is the only node that is reachable from the external network (college LAN). All other compute nodes are reachable only via the master node.
Here is the output of lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
CPU MHz: 2401.000
BogoMIPS: 4788.55
Virtualization: VT-x
Lid cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node CPU(s): 0-7
NUMA node1 CPU(s): 8-15
The Compute Nodes
CPU - 2 x (Intel Processor.) (The data on the college's website is incorrect. Compute nodes have a slower processor than the Master node.)
Cores - 8 per CPU
Hyperthreading - Disabled (I don't know if it can be enabled through software)
Virtualization - Available
Total CPUs - 16
Clock speed - 1.2 GHz
GPU - Nvidia Tesla K20m. (Only available on GPU node.)
Memory - 64 GB
Internal HDD - Total size - 500 GB
Partition 1 - /dev/sda3 ~415 GB mounted at /.
Partition 2 - /dev/sda1 ~500 MB mounted at /boot.
Operating System - CentOS 6.5 release
Domain Name - csehpc-n[x].bitmesra.ac.in where [x] is the node number from 2-18 (Total 17 nodes).
IP - External - Unreachable | Internal Primary - 192.168.10.[x] | Internal Secondary - 10.10.1.[x]
All compute nodes have the same architecture and can be used for parallel processing. The compute nodes are unreachable from the external network. You must use the master node to access all compute nodes. Furthermore, none of the compute nodes have access to the internet; they cannot even resolve a domain name.
Here is the output of lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Stepping: 2
CPU MHz: 1200.000
BogoMIPS: 4788.55
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA nodeo CPU(s): 0-7
NUMA node1 CPU(s): 8-15
The Storage Node
The storage node has a total capacity of 48 TB. Out of this, 21 TB is unavailable for use as it is used for cloud storage (I think it is reserved for professors). The remaining storage is mounted on /home, which is available across all the nodes (master + compute). This means that any file that is in the /home folder of any one node will be accessible at the same location on other nodes.
How is this helpful? Since the compute nodes do not have internet access, you can download a file on the master node, and that file will be available across all compute nodes.
The Network
All the nodes are connected via an InfiniBand switch. InfiniBand switches are specially designed for HPC servers to provide low latency and high throughput.
There are two internal networks in the HPC:
1. Primary network:
- Gateway: 192.168.10.0
- Nodes: 192.168.10.1 - 192.168.10.18
- Domain: csehpc.bitmesra.ac.in
2. Secondary (Backup) network:
- Gateway: 10.10.1.0
- Nodes: 10.10.1.1 - 10.10.1.18
- Domain: icsehpc
Here is the output of cat /etc/hosts:
127.0.0.1 localhost localhost.localdomain
172.16.23.1 csehpc.local
Primary Network
192.168.10.1 csehpc.bitmesra.ac.in csehpc
192.168.10.2 csehpc-n1.bitmesra.ac.in csehpc-n1
192.168.10.3 csehpc-n2.bitmesra.ac.in csehpc-n2
192.168.10.4 csehpc-n3.bitmesra.ac.in csehpc-n3
192.168.10.5 csehpc-n4.bitmesra.ac.in csehpc-n4
192.168.10.6 csehpc-n5.bitmesra.ac.in csehpc-n5
192.168.10.7 csehpc-n6.bitmesra.ac.in csehpc-n6
192.168.10.8 csehpc-n7.bitmesra.ac.in csehpc-n7
192.168.10.9 csehpc-n8.bitmesra.ac.in csehpc-n8
192.168.10.10 csehpc-n9.bitmesra.ac.in csehpc-n9
192.168.10.11 csehpc-n10.bitmesra.ac.in csehpc-n10
192.168.10.12 csehpc-n11.bitmesra.ac.in csehpc-n11
192.168.10.13 csehpc-n11.bitmesra.ac.in csehpc-n12
192.168.10.14 csehpc-n11.bitmesra.ac.in csehpc-n13
192.168.10.15 csehpc-n11.bitmesra.ac.in csehpc-n14
192.168.10.16 csehpc-n11.bitmesra.ac.in csehpc-n15
192.168.10.17 csehpc-n11.bitmesra.ac.in csehpc-n16
192.168.10.18 csehpc-n11.bitmesra.ac.in csehpc-n17
Secondary Network
10.10.1.1 icsehpc
10.10.1.2 icsehpc-n1
10.10.1.3 icsehpc-n2
10.10.1.4 icsehpc-n3
10.10.1.5 icsehpc-n4
10.10.1.6 icsehpc-n5
10.10.1.7 icsehpc-n6
10.10.1.8 icsehpc-n7
10.10.1.9 icsehpc-n8
10.10.1.10 icsehpc-n9
10.10.1.11 icsehpc-n10
10.10.1.12 icsehpc-n11
10.10.1.13 icsehpc-n12
10.10.1.14 icsehpc-n13
10.10.1.15 icsehpc-n14
10.10.1.16 icsehpc-n15
10.10.1.17 icsehpc-n16
Conclusion
That's all for the architecture of the HPC. I hope that the architecture fascinated you. The HPC is very powerful if you know how to utilize it effectively. In the next article, I will help you set up a Machine Learning environment on a single compute node.
Thanks for the info. Will try!
ReplyDelete