Setting up Machine Learning environment on High Performance Computing Server

In the last article, I had discussed the architecture of the HPC. In case you have not read the article, I would recommend you to read it before moving on to this article.

Architecture of High Performance Computing Server at BIT Mesra 

The power of HPC can be utilized for its most important application in the field of computer science - Machine Learning. I am assuming that you already have got your SSH credentials to log on to the master node. Also, we will be setting up the environment in Python. Lets jump straight to the steps.

Step 1: Download and Install Anaconda on the master node

Note that you are not the root user of HPC. You are just a regular user and so administrative commands (sudo or su) won't work. Anaconda has made it much easier to install Python packages for non-root users and we will be using Anaconda for setting up Python 3 and installing required packages.

  • Login to the master node.
    > ssh be1005815@172.16.23.1
  • Go to https://www.anaconda.com/distribution/ and copy the link for the 64 bit Linux Python 3 distribution. Download the package on the master node using wget. The --no-certificate-check flag is because the OS is too old to check new TLS certificates.
    > wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh --no-check-certificate
  • Change the permission of the downloaded file.
    > chmod u+x Anaconda3-2018.12-Linux-x86_64.sh
  • Install Anaconda.
         > ./Anaconda3-2018.12-Linux-x86_64.sh
  • Follow the steps to install Anaconda on the master node.
  • Anaconda will be installed at /home/<username>/anaconda3/. Logout and login again to the master node.

Step 2: Create an environment and install packages

In the next step, we will login to one of the compute node and activate a Python environment.
  • Create a new Python environment using Conda. I want to setup a Tensorflow environment, so I will name it 'tensorflow'.  
    > conda create -n tensorflow python=3.6
  • Activate the environment.
    > conda activate tensorflow
  • Install the required packages. Here I will be installing Tensorflow CPU and Keras. Tensorflow GPU is currently not supported by CentOS 6.5 because of old version of GLibC. You can install other packages like Pandas or Pytorch.
         > conda install tensorflow keras
  • Test the installation by starting Python.
         > python -m "tensorflow"
  • Deactivate the created environment on the master node. Remember, the master node is only for distribution. Do not use the master node for computation. 
         > conda deactivate
  • Login to the one of the compute node from the master node (Remember, all installation done on the master node is now available on the compute nodes as well). You can use commands like free (to check memory usage) or top (to list all running processes) to check which node can be used for your application. Here I am using Compute Node 1.
    > ssh be1005815@10.10.1.2
  • Activate the environment on the compute node.
    > conda activate tensorflow

Step 3: Start Jupyter Notebook and setup SSH Tunnel

Jupyter notebook is installed by default along with Anaconda. You can start Jupyter notebook on one of the compute node and setup an SSH Tunnel on master node to access the compute node.
  • On the same compute node, create a Jupyter notebook configuration file.
    > jupyter notebook --generate-config
  • Open the configuration in your favourite command line editor and modify two lines to allow remote access from any IP.
    > nano .jupyter/jupyter_notebook_config.py 

           c.NotebookApp.allow_origin = '*' #allow all origins
           c.NotebookApp.ip = '0.0.0.0' # listen on all IPs
  • Start the Jupyter notebook. Once started, you can copy the token from the Command Line.
    > jupyter notebook
  • Don't close the existing SSH session. Open a new terminal and log on to the master node again.
    > ssh be1005815@172.16.23.1
  • Create a SSH tunnel on the master node. Here I am mapping port 8888 of Compute Node 1 to port 8000 of master node (If this port is not available on master node, then try some other port).
    > ssh -g -L 8000:localhost:8888 -f -N be1005815@10.10.1.2
  • Go to the browser and open the address http://172.16.23.1:8000/. Enter the copied token and you would be able to access the Notebook.
  • Done

A word of advice

  • Don't eat up the ports on master node. Once done, you can kill the SSH tunnelling by running the following command on master node.
    > killall ssh
    • You can use Process Managers like PM2 or Nohup to permanently start a jupyter notebook session on compute nodes. Don't forget to exit the Process Managers once done.
    • Tensorflow GPU requires GLibC >= 2.14 while the one installed on GPU compute node is GLibC 2.12. That's why Tensorflow cannot use GPU as of now. In case someone finds an alternative way of using the GPU with Tensorflow, do tell me and I would be more than willing to link to your article.
    • Respect the computing power. It is shared by everyone in the college. Don't waste the resource by running unnecessary computation.

    Conclusion

    So, that is all you need to setup ML environment on HPC. Note that, you are still using a single compute node. 16 cores must be enough for dataset < 1 GB. I will be posting another article in which we will setup Big Data processing on HPC.

        Comments

        1. Great article. Just a few doubts :
          1. How do you get IPs of all compute nodes?
          2. When you kill all ssh on master how do you make sure that others' ssh tunnels are not killed?

          Also slight correction in this line 'If this port is not available on master node, then try some other node' .. Should be port instead of node.

          ReplyDelete
          Replies
          1. Got answer to my first question in your previous post.

            Delete
          2. Thanks for the correction. killall kills all the ssh tunnels. In case you need to kill a particular ssh tunnel, find its process id using top and then use kill to kill that particular ssh tunnel.

            Delete
        2. It may not be possible for non root users to directly run 'conda activate'. If this is the case, users can use 'source activate envname'

          ReplyDelete

        Post a Comment

        Popular posts from this blog

        DDoS Attack on Bitotsav '19 Website

        Architecture of High Performance Computing Server at BIT Mesra