Following a survey made for a well estabilished group at LinkedIn, Big Data professionals impute the causes of failure of most Big Data projects to a lack of knowledge on how to set up, deploy and manage a Hadoop cluster. Although I consider this vision a bit too much simplistic, while I endorse a vision where the change management matters, the bootstrap difficulties for the IT departments are indeed huge and steep. Major Hadoop distribution companies, like Cloudera and IBM, are aware of this flaw and provide sophisticated installers and managers, while they offer quick start VMs for developers at same time. First of all, this cannot be enough for those Enterprises which need to put their production environments under control and it’s even not enough for those developers which need to use something more reliable than a single-node virtual cluster. Finally, it’s important to understand how Hadoop works under the hood, but of course without wasting time on configuration details.
Vagrant definitively comes to help. Using a Cloudera Vagrant box, you’ll be able to build a complete scalable cluster that (with some minor tuning) could even be used in production environments. In this tutorial we’re going to build a cluster based on Cloudera Hadoop Distributions in minutes and without user interventions.
First of all, I must say that Cloudera Manager is really a nice tool, and I use it on my daily based tasks. It starts its duties installing a cluster and then keep it manageable and healthy. It centralizes config files, logs, outputs and process management, and that’s great. But installation of the cluster is still a process that needs user intervention: the administrator must add the nodes to the cluster by hand, then enlists services, then provides database credentials and so on. This lend to several ugly states:
- User interventions means a huge waste of time for installing the cluster; this is especially true for developers, assuming QuickVMs are not enough for them;
- User interventions means error prone installations;
- User interactions means it cannot be possible to automate installations using script; this is expensive for IT departments since they cannot use Puppet or similar to automate server management.
Vagrant is the tool I decided to use. What exactly Vagrant is and how to install it is beyond the scope of this tutorial, but there’s plenty of tutorials on the official site on it. To start, we’re going to get the needed Vagrant files and configuring them.
Getting needed Vagrant files and configuring the cluster
Assuming you have Vagrant already installed, you need to clone or download this GitHub repository. Unzip it in a directory on a local disk. You may see that there’s just some bash script on the repo, but several virtual hard disk files will be created on this directory upon provisioning, so be sure you’ve enough free space on the disk.
Install VirtualBox if you’ve not done already. Although Vagrant supports a range of virtualization providers, I use a lot of specific features of VirtualBox on this provisioner. This makes VirtualBox a mandatory requirement at this stage.
The cluster will be based on CentOS 6.5 x64; in other words, each VM built by the script will run this operating system.
The provisioner provides some shell scripts used for automatically setup the bounce of virtual machines in the virtual network. Although you won’t edit these files in normal conditions, sometimes you have to; so let’s have a look to them in details:
- provision_for_mount_disk.sh: this script is executed on each node of the cluster. By default, Vagrant boxes come with a 8 GB virtual hard disk, which is by far too small for any serious Big Data storage. Said that, the provisioner (exactly here) automatically creates a bigger secondary virtual hard disk for each node on the working directory, and this shell script mounts these disks to /opt.
- provision_for_cdh_node.sh: this script is executed on each node of the cluster. Its purpose is to install Oracle Java SDK 1.7 and makes it default, since Cloudera doesn’t support OpenJDK, which is provided by default on CentOS.
- provision_for_cdh_datanode.sh: this script is executed on each node of the cluster except the master node. Its purpose is to install and start all the services needed by datanodes and HBase region nodes. Datanodes share a virtual private network with the Masternode, but they don’t have a public IP. This means you won’t be able sto access them directly. However, Vagrant SSH port forwarding is enabled to remote login to instances; forwarded ports start at 2200 and following, but you must check the log printed during the provisioning phase to be sure of which port is linked to a particular node.
- provision_for_cdh_masternode.sh: this script is executed on the master node only. You would like to edit it to tune configurations and root password for MySql or some other minor changes, but you can use the default configuration as well. The masternode shares the virtual private network with datanodes, but has a secondary network interface with a public IP as well. By default, the IP is took bridging to the LAN used on the host machine and through DHCP, but you may change this behaviour editing this line of Vagrantfile.
Said that, let’s have a look to parameters you actually need to set up. They are on Vagrantfile:
# Configuration parameters
managerRam = 2048 # Ram in MB for the Cludera Manager Node
nodeRam = 2048 # Ram in MB for each DataNode
nodeCount = 3 # Number of DataNodes to create
privateNetworkIp = "10.10.50.5" # Starting IP range for the network between nodes
secondaryStorage = 80 # Size in GB for the secondary virtual HDD
Their functions should be clear, but let’s go deeper:
- managerRam: it’s the amount of RAM for the master node; 2048 is the absolute minimum, but 4096 is recommended (especially if you plan to use Hue)
- nodeRAM: it’s the amount of RAM for each datanote; 2048 should be enough in normal conditions
- nodeCount: the number of datanodes to create. Each datanode spawns its own VM, so at the end of the day your cluster will have nodeCount + 1 (the master node) VMs. Keep in mind this number while planning storage and RAM capacities.
- privateNetworkIp: this is the private IP for the fist machine of the cluster (the masternode), others will follow. Tune this parameter if you need to link your cluster to an existing private network or to other Hadoop clusters, but keep in mind that all VMs must share the same virtual private network name which by default is cdhnetwork (virtual private network name is set here and here)
- secondaryStorage: the size of the secondary virtual hard disk made for data storage. As usual, these disks are one-way dynamically allocated: starting small, increasing in size upon usage.
After you set these parameters and assuming you don’t need deeper customization, it’s time to start the cluster with the up command triggered in the working directory. I suggest you to redirect the stdout and to inspect the log file during the way, just to know what’s going on.
# vagrant up > ./cdh5-provisioning.log &
# tail ./cdh-5provisioning.log
You may notice the welcome message in the log after master node provisioning. It’s a summary of installed services and their web interfaces, when available. Don’t forget to have a look to the documentation for the provisioner and have a nice Map/Reduce day!