How to Setup Single Node (Pseudo Distributed Node) Hadoop Cluster
Step 1:
Get hadoop rpm from apache site, search on google “apache hadoop download”
http://www.apache.org/dyn/closer.cgi/hadoop/common/
in LinuxWorld Lab, run
# yum install hadoop
step 2:
Get java rpm from oracle site , search on google “jdk download”
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
in LinuxWorld Lab, run
# yum install jdk
step 3:
[root@server Desktop]# rpm –ql jdk | grep java$
/etc/.java
/usr/java
/usr/java/jdk1.7.0_51/bin/java
/usr/java/jdk1.7.0_51/jre/bin/java
[root@server Desktop]# /usr/java/jdk1.7.0_51/bin/java -version
java version “1.7.0_51″
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
[root@server Desktop]# java -version
java version “1.7.0_09-icedtea”
OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
[root@server Desktop]# echo $JAVA_HOME
/usr
[root@server Desktop]# JAVA_HOME=/usr/java/jdk1.7.0_51/
[root@server Desktop]# echo $JAVA_HOME
/usr/java/jdk1.7.0_51/
[root@server Desktop]# java -version
java version “1.7.0_09-icedtea”
OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
[root@server Desktop]# PATH=$JAVA_HOME/bin:$PATH
Note: $JAVA_HOME must be put first then $PATH in above cmd
[root@server Desktop]# java -version
java version “1.7.0_51″
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
Step 4:
[root@server Desktop]# vim /root/.bash_profile
export JAVA_HOME=/usr/java/jdk1.7.0_51/
PATH=$JAVA_HOME/bin:$PATH
[root@server Desktop]# . /root/.bash_profile
Step 5:
Hadoop is by default setup but we need to configure java path on its internal file
[root@localhost /]# vim /etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_51/
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=500
to test it is working, run below cmd
[root@localhost /]# hadoop fs -ls /
Step 6: Setup HDFS name and data node
[root@server hadoop]# vim /etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/data/nodename</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/data/dataname</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
</configuration>
note: above directory automatically created, no need to create before
root@server hadoop]# hadoop namenode –format
Step 7: To start name and data node
[root@server hadoop]# vim /etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:10001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>
[root@server hadoop]# hadoop-daemon.sh start namenode
above cmd start some port, run
#netstat -tnlp | grep java
tcp 0 0 127.0.0.1:10001 0.0.0.0:* LISTEN 14969/java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 14969/java
[root@server hadoop]# hadoop-daemon.sh start datanode
above cmd start some port, run
#netstat -tnlp | grep java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 15093/java
tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 15093/java
To verify:
[root@server hadoop]# jps
8177 Jps
8126 DataNode
7933 NameNode
Or go to url, as “50070” is name node management port
in CLI,we can also see the report
[root@server hadoop]# hadoop dfsadmin –report
You can check hadoop hdfs filesytem,initially there is nothing
# hadoop fs -ls /
Create directory in hdfs filesystem
# hadoop fs -mkdir /input
Upload or copy local file into hdfs filesystem
# hadoop fs -copyFromLocal test.txt /input
Note : it uploaded to datanode at the storage folder named “current” in distributed fashion of maximum file size “64MB” as bcoz by default block size is 64MB
You can change block size in hdfs-site.xml
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
Bydefault it copy to 3 datanode, as by default replication is 3, you can change it in hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
List file in hdfs
# hadoop fs -ls /input
# hadoop fs -lsr /
How to Setup Map Reduce
Step 1:
Setup Mapred-site.xml file:
# vim /etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.16:9001</value>
</property>
</configuration>
Step 2: start jobtracker
# hadoop-daemon.sh start jobtracker
starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-desktop16.example.com.out
# jps
7247 JobTracker
6467 DataNode
6541 NameNode
7325 Jps
Note: it start 2 new port, check
# netstat -tnlp | grep java
tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 7411/java
tcp 0 0 192.168.0.16:9001 0.0.0.0:* LISTEN 7411/java
Where, 50030 is management port for mapreduce,
Check it : http://127.0.0.1:50030
Step 3: start tasktracker
# hadoop-daemon.sh start tasktracker
starting tasktracker, logging to /var/log/hadoop/root/hadoop-root-tasktracker-desktop16.example.com.out
# jps
7639 Jps
7569 TaskTracker
6467 DataNode
7411 JobTracker
6541 NameNode
Step 4:Test your setup, by run example file from hadoop rpm
You can get it here
# rpm -ql hadoop | grep examples
/usr/share/hadoop/hadoop-examples-1.2.1.jar
# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount /input /output
14/05/07 14:38:01 INFO input.FileInputFormat: Total input paths to process : 1
14/05/07 14:38:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/07 14:38:01 WARN snappy.LoadSnappy: Snappy native library not loaded
14/05/07 14:38:02 INFO mapred.JobClient: Running job: job_201405071431_0001
14/05/07 14:38:03 INFO mapred.JobClient: map 0% reduce 0%
14/05/07 14:38:12 INFO mapred.JobClient: map 100% reduce 0%
14/05/07 14:38:20 INFO mapred.JobClient: map 100% reduce 33%
14/05/07 14:38:21 INFO mapred.JobClient: map 100% reduce 100%
14/05/07 14:38:22 INFO mapred.JobClient: Job complete: job_201405071431_0001
# hadoop job -list all
1 jobs submitted
States are:
Running : 1 Succeded : 2 Failed : 3 Prep : 4
JobId State StartTime UserName Priority SchedulingInfo
job_201405071431_0001 2 1399453681859 root NORMAL NA
# hadoop fs -ls /output
Found 3 items
-rw-r–r– 3 root supergroup 0 2014-05-07 14:38 /output/_SUCCESS
drwxr-xr-x – root supergroup 0 2014-05-07 14:38 /output/_logs
-rw-r–r– 3 root supergroup 34 2014-05-07 14:38 /output/part-r-00000
Note: _SUCCESS file created means map reduce is successfully done
Note: part-r-00000 contains output of reducer , final output
You can see the final output of map reduce job
# hadoop fs -cat /output/part-r-00000
And we can also see by web UI
http://127.0.0.1:50070 -> Browse the filesystem
if you want to list complete details of running or completed job, then use job id with status option
# hadoop job -status job_201405071431_0004
Job: job_201405071431_0004
file: hdfs://192.168.0.16:10001/tmp/hadoop-root/mapred/staging/root/.staging/job_201405071431_0004/job.xml
tracking URL: http://desktop16.example.com:50030/jobdetails.jsp?jobid=job_201405071431_0004
map() completion: 0.017579561
reduce() completion: 0.0
Counters: 3
Job Counters
SLOTS_MILLIS_MAPS=2481
Launched map tasks=2
Data-local map tasks=2
# hadoop job -list
1 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo
job_201405071431_0004 1 1399486864042 root NORMAL NA
if you want to kill running process
# hadoop job -kill job_201405071431_0004
Killed job job_201405071431_0004
If you want to change the priority of one job over other
# hadoop job -set-priority job_201405071431_0004 LOW
Changed job priority.
You can list by below command
#hadoop job -list all
4 jobs submitted
States are:
Running : 1 Succeded : 2 Failed : 3 Prep : 4
JobId State StartTime UserName Priority SchedulingInfo
job_201405071431_0001 2 1399453681859 root NORMAL NA
job_201405071431_0002 3 1399462379102 root NORMAL NA
job_201405071431_0003 2 1399462502071 root NORMAL NA
job_201405071431_0004 2 1399486864042 root LOW NA
BY default FIFO scheduler is used in “Apache Hadoop”
We can change to “Fair Scheduler” in mapred-site.xml file
Step 1:
# vim /etc/hadoop/mapred-site.xml
Step 2:
In mapred-site.xml of job tracker,specify the scheduler used :
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
Identify the pool configuration file :
<property>
<name>mapred.fairscheduler.allocation.file</name>
<value>/etc/hadoop/fair-scheduler.xml</value>
</property>
Step 3:
# vim /etc/hadoop/fair-scheduler.xml
<allocations>
<pool name=”tech”>
<minMaps>10</minMaps>
<minReduces>5</minReduces>
<maxRunningJobs>2</maxRunningJobs>
</pool>
<pool name=”hr”>
<minMaps>10</minMaps>
<minReduces>5</minReduces>
</pool>
<user name=”vimal”>
<maxRunningJobs>2</maxRunningJobs>
</user>
</allocations>
Step 4:
# hadoop-daemon.sh stop jobtracker
stopping jobtracker
# hadoop-daemon.sh start jobtracker
starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-desktop16.example.com.out
Step 5:
Run job with pool name “tech”
# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount -Dpool.name=tech /input /output3