Wednesday, February 13, 2013

Hadoop Installation on Ubuntu (A to Z )

Single Node Cluster setup

1) Making apt-get work and installing ssh server & client for remote login to other machines
Run this command to add a dedicated user hduser and the group hadoop to your local machine
(every machine on Multi-node cluster system must have hduser otherwise you will get error while
making ssh connection with remote hosts).
% sudo addgroup hadoop
% sudo adduser --ingroup hadoop hduser

NOTE:-Login as hduser.
a.ii) If u are using any proxy server than configure /etc/apt/apt.config file by typing (as root node)
(a.ii.1) % gksudo gedit /etc/apt/apt.conf or % vi gedit /etc/apt/apt.conf (if doesn’t not exists
already than create it) on the command prompt and adding following lines.
Acquire::http::Proxy “http://name:password@proxyip:port”
Acquire::ftp::Proxy “ftp://name:password@proxyip:port”
Acquire::https::Proxy “https://name:password@proxyip:port”
(a.ii.2) Where name and password are proxy username and proxy password respectively.
Check by entering
% sudo apt-get update(it should work).
a.iii) Also uncomment the sources from /etc/apt/source.list file, by removing # from the very first
line and lines containing deb next to #.

a.iv) Run this command to install ssh server and client
(a.iv.1) % sudo apt-get install openssh-server
(a.iv.2) % sudo apt-get install openssh-client
(a.iv.3) Now the server is running check by using %sudo service ssh status you can also use one
of the following < start | restart | stop > in the last of above mentioned command.
(a.iv.4) NOTE:- ssh services make access to remote machines and use that machine as we are
sitting there.
(a.iv.4.a) Now we will generate a pair of public and private keys so that will help in login to
the local machine without asking for the password every time we make the connection and
(a.iv.4.b) making connection to the nodes (datanodes, secondary node, jobtracker,
tasktracker) from the namenode without asking for the passwords each time we remotely login
to the nodes. For now we just want to make ssh connection with the localhost so do that by
(a.iv.4.b.i) hduser@ubuntu:~$ssh localhost
(a.iv.4.c) generating key pair
(a.iv.4.c.i) hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
(a.iv.4.c.ii) OUTPUT:-Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
(a.iv.4.c.iii) The second line will create an RSA key pair with an empty password.
Second, you have to enable SSH access to your local machine with this newly created key.

hduser@ubuntu:~$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys
(a.iv.4.c.iv) The final step is to test the SSH setup by connecting to your local machine
with the hduser user. The step is also needed to save your local machine’s host key
fingerprint to the hduseruser’s known hostsfile.
hduser@ubuntu:~$ ssh localhost
OUTPUT:-The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC
2010 i686 GNU/Linux
Ubuntu 10.04 LTS

2) Disabling IPv6
Open file /etc/sysctl.conf and add the following lines to the end of the file:
#disables ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).

3) JDK INSTALLATION and HADOOP configuration
Download jdk1.7.0_04.tar.gz for your system or higher version from by
selecting JavaSE.
hduser@ubuntu:~# tar xzf jdk1.7.0_04.tar.gz
Download a stable version of Hadoop like from
Unpack both the files in the hduser’s home directory
hduser@ubuntu:~# tar xzf hadoop-1.3.0.tar.gz
hduser@ubuntu:~# mv hadoop-0.20.2 hadoop
hduser@ubuntu:~# chown -R hduser:hadoop hadoop
Update $HOME/.bashrc or $HOME/.bash_profile (By command
hduser@ubuntu:~#gedit $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of hduser.
# Set Hadoop-related environment variables
export HADOOP_HOME=/ home/hduser /hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/home/hduser/jdk1.7.0_04
# some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
# Requires installed 'lzop' command.
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
# Add Hadoop bin/ directory to PATH

You can repeat this exercise also for other users who want to use Hadoop
a.i) After installation, make a quick check whether JDK is correctly set up:
hduser@ubuntu:~# java –version
a.ii) After installation, make a quick check whether Hadoop is correctly set up:
hduser@ubuntu:~# hadoop version
Open /home/hduser/hadoop/conf/ and set the JAVA_HOME environment
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
# The java implementation to use. Required.
export JAVA_HOME=/home/hduser/jdk1.7.0_04
Now we create the directory and set the required ownerships and permissions:
% sudo mkdir -p /app/hadoop/tmp
% sudo chown hduser:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
% sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a
when you try to format the name node in the next section).
Add the following snippets between the <configuration> ... </configuration>tags in the respective
configuration XML file.
In file conf/core-site.xml :
<!-- In: conf/core-site.xml -->
<description>A base for other temporary directories.</description>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
In file conf/mapred-site.xml :
<!-- In: conf/mapred-site.xml -->
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
In file conf/hdfs-site.xml :
<!-- In: conf/hdfs-site.xml -->
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

4) Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is
implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
Note :Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).

hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format
The output will look like this:
hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format
05/22/12 16:59:56 INFO namenode. NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.3
STARTUP_MSG: build = 911707;
compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
05/22/12 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
05/22/12 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
05/22/12 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
05/22/12 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
05/22/12 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode. NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/

Starting your single-node cluster
Run the command:
a.i) hduser@ubuntu:~#
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanodeubuntu. out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hdusersecondarynamenode-
ubuntu.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoophduser-
jobtracker-ubuntu.out localhost: starting tasktracker, logging to usr/local/hadoop/bin /../logs/

To check whether your tasktracker, jobtracker, datanode or namenode get started or not just type jps.
a.ii) hduser@ubuntu:~# jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode
a.iii) You can also check with netsta tif Hadoop is listening on the configured ports.
hduser@ubuntu:~# sudo netstat -plten | grep java
tcp 0 0* LISTEN 1001 9236 2471/java
tcp 0 0* LISTEN 1001 9998 2628/java
tcp 0 0* LISTEN 1001 8496 2628/java
tcp 0 0* LISTEN 1001 9228 2857/java
tcp 0 0* LISTEN 1001 8143 2471/java
tcp 0 0* LISTEN 1001 9230 2857/java
tcp 0 0* LISTEN 1001 8141 2471/java
tcp 0 0* LISTEN 1001 9857 3005/java
tcp 0 0* LISTEN 1001 9037 2785/java
tcp 0 0* LISTEN 1001 9773 2857/java
If there are any errors, examine the log files in the /logs/directory.
a.iv) Stopping your single-node cluster
Run the command
to stop all the daemons running on your machine.
hduser@ubuntu:/usr/local/hadoop$ bin/
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
a.v) Running a MapReduce job
We will now run your first Hadoop MapReducejob. We will use the WordCount example jobwhich reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
Download some text file in plain text UTF-8 encoding and store the files in a temporary directory of your choice, for example ~/gutenberg.
You can download text files from these links:
hduser@ubuntu:~# ls -l ~/gutenberg/
total 3604
-rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt
hduser@ubuntu:~# Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
a.vii) Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to
Hadoop’s HDFS.
hduser@ubuntu:~# hadoop dfs -copyFromLocal ~/gutenberg ~/gutenberg_hds
hduser@ubuntu:~# hadoop dfs -ls /home/hduser
Found 1 items
drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /home/hduser/gutenberg_hds
hduser@ubuntu:~# hadoop dfs -ls ~/gutenberg_hds
Found 3 items
-rw-r--r-- 3 hduser supergroup 674566 2012-22-05 11:38 /home/hduser/ gutenberg_hds /pg20417.txt
-rw-r--r-- 3 hduser supergroup 1573112 2012-22-05 11:38 /home/hduser/ gutenberg_hds /pg4300.txt
-rw-r--r-- 3 hduser supergroup 1423801 2012-22-05 11:38 /home/hduser/ gutenberg_hds /pg5000.txt

a.viii) Run the MapReduce job
Now, we actually run the WordCount example job.
hduser@ubuntu:~# hadoop jar hadoop*examples*.jar wordcount ~/gutenberg_hds ~/gutenbergoutput
This command will read all the files in the HDFS directory r/gutenberg_hds, process it, and store the
result in the HDFS directory /gutenberg-output.
Exemplary output of the previous command in the console:
hduser@ubuntu:~# hadoop jar hadoop*examples*.jar wordcount ~/gutenberg_hds ~/gutenbergoutput
05/22/12 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
05/22/12 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
05/22/12 17:43:02 INFO mapred.JobClient: map 0% reduce 0%
05/22/12 17:43:14 INFO mapred.JobClient: map 66% reduce 0%
05/22/12 17:43:17 INFO mapred.JobClient: map 100% reduce 0%
05/22/12 17:43:26 INFO mapred.JobClient: map 100% reduce 100%
05/22/12 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001
05/22/12 17:43:28 INFO mapred.JobClient: Counters: 17
05/22/12 17:43:28 INFO mapred.JobClient: Job Counters
05/22/12 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1
05/22/12 17:43:28 INFO mapred.JobClient: Launched map tasks=3
05/22/12 17:43:28 INFO mapred.JobClient: Data-local map tasks=3
05/22/12 17:43:28 INFO mapred.JobClient: FileSystemCounters
05/22/12 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026
05/22/12 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
05/22/12 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918
05/22/12 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330
05/22/12 17:43:28 INFO mapred.JobClient: Map-Reduce Framework
05/22/12 17:43:28 INFO mapred.JobClient: Reduce input groups=82290
05/22/12 17:43:28 INFO mapred.JobClient: Combine output records=102286
05/22/12 17:43:28 INFO mapred.JobClient: Map input records=77934
05/22/12 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796
v 17:43:28 INFO mapred.JobClient: Reduce output records=82290
05/22/12 17:43:28 INFO mapred.JobClient: Spilled Records=255874
05/22/12 17:43:28 INFO mapred.JobClient: Map output bytes=6076267
05/22/12 17:43:28 INFO mapred.JobClient: Combine input records=629187
05/22/12 17:43:28 INFO mapred.JobClient: Map output records=629187
05/22/12 17:43:28 INFO mapred.JobClient: Reduce input records=102286

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:
hduser@ubuntu:~# hadoop dfs -ls /home/hduser
Found 2 items
drwxr-xr-x - hduser supergroup 0 2012-22-05 17: /home/hduser/ gutenberg_hds
drwxr-xr-x - hduser supergroup 0 2012-22-05 17:43 /home/hduser/gutenberg-output
hduser@ubuntu:~# hadoop dfs -ls /home/hduser/gutenberg-output
Found 2 items
drwxr-xr-x - hduser supergroup 0 2012-22-05 17:43 /home/hduser/gutenberg-output /logs
-rw-r--r-- 1 hduser supergroup 880802 2012-22-05 17:43 /home/hduser/gutenberg-output/part-r-

a.ix) Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the
hduser@ubuntu:~# hadoop dfs -cat /home/hduser/gutenberg-output/part-r-00000
to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will
copy the results to the local file system though.
hduser@ubuntu:~# mkdir ~/gutenberg-output
hduser@ubuntu:~# hadoop dfs -getmerge /user/hduser/gutenberg-output ~/gutenberg-output
hduser@ubuntu:/usr/local/hadoop$ head ~/gutenberg-output/gutenberg-output
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1

Running hadoop on multi-node cluster
5) Prerequisites and assumptions:
• Set up a single-node hadoop cluster on each node
• Shutdown each single-node cluster with /bin/ before continuing if you have n’t done so already.
• Among the nodes in the cluster, one of the nodes is made as master and the remaining nodes are made as slaves. Let the hostname of master machine be masterand hostname of the slave machines be slave1, slave2, slave3...... For now, let us take only 2 slaves ie: slave1, slave2 and also master is
considered as both master and slave.
Note: Be careful to use correct hostnames. Open /etc/hostname to know the machine’s host name.
‘hduser@master:-$ ‘means the command is being run from hadoop user account of master node and
‘hduser@slave:-$ ‘means the command is being run from hadoop user account of slave node.

6) Configuring multi-node from single-node:
1. Append the following lines to /etc/hosts file on all the machinesof cluster
# /etc/hosts (for master and slaves) master slave1 slave2
Note:The ip addresses used above are sample ip’s, replace them with the ip’s of the nodes in the cluster.
Also change the line, <hostname> to # <hostname>

2. From master PC, run the following
hduser@master:-$ ssh-copy-id -i $HOME/.ssh/ hduser@master
hduser@master:-$ ssh-copy-id -i $HOME/.ssh/ hduser@slave1
hduser@master:-$ ssh-copy-id -i $HOME/.ssh/ hduser@slave2
Note:These commands should be run from master PC only. If the number of nodes in the cluster is n, the commands to be run are n. If is n is very large, write a shell script implementing looping .
Testing:Try to connect to slaves from master pc, you should connect without being asked for password.
hduser@master:-$ ssh master
hduser@master:-$ssh slave1
hduser@master:-$ssh slave2

3. On master node, update /usr/local/hadoop/conf/slaves file such that it looks in the following way
Note:List all the hostnames of slave machines in this file.
4. change the configuration files conf/core-site.xml, conf/mapred-site.xmland conf/hdfs-site.xmlon
machines as follows.
i. In core-site.xml
<value>hdfs://localhost:54310</value> is changed to <value>hdfs://master:54310</value>
ii. In mapred-site.xml
<value>localhost:54311</value> is changed to
iii. In hdfs-site.xml
<name>dfs.replication</name><value>1</value> is changed to
Note: The dfs.replication value in hdfs-site.xml is made 3 because we have 3 machines in our
cluster. If we have n machines in the cluster, we would have written there n instead of 3.

5. On master node, open /app/hadoop/tmp/dfs/name/current/VERSION file and note the
namespace id on paper.
Now go to /app/hadoop/tmp/dfs/data/current/VERSION file of all the slave machines and replace the
namespaceId there with the one that is noted.
---------------------------------Multinode hadoop cluster is set up-------------------------------------------
Testing the multinode hadoop cluster: on master node. It should start namenode, datanode, secondary
namenode, jobtracker, tasktracker on master machine and only datanode, tasktracker on slave
nodes(excluding master) .
16017 Jps
14799 NameNode
15686 TaskTracker
14880 DataNode
15596 JobTracker
14977 SecondaryNameNode
hduser@slave:# jps
15183 DataNode
15897 TaskTracker
16284 Jps
hduser@slave:# jps
Run the mapreduce job as illustrated in single-node cluster set up. To stop hadoop, run on
master node. It will stop the daemons (datanode,tasktracker,namenode,......) on all the nodes in the cluster.
You can check by running jps command on the nodes as illustrated before.
Note: All the commands which involve
I. Copying data to HDFS
II. Running mapreduce job
III. Retrieving output from HDFS
are issued from the master node itself. We can look at the status of the slave machines also from the
master node.

No comments:

Post a Comment