Quantcast
Viewing all articles
Browse latest Browse all 10

How To Setup Single Node BigData Hadoop Cluster

How to Setup Single Node (Pseudo Distributed Node) Hadoop Cluster

Step 1:

Get hadoop rpm from apache site, search on google “apache hadoop download”

http://www.apache.org/dyn/closer.cgi/hadoop/common/

in LinuxWorld Lab, run

# yum install hadoop

step 2:

Get java rpm from oracle site , search on google “jdk download”

http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

in LinuxWorld Lab, run

# yum install jdk

step 3:

[root@server Desktop]# rpm –ql   jdk | grep java$

/etc/.java

/usr/java

/usr/java/jdk1.7.0_51/bin/java

/usr/java/jdk1.7.0_51/jre/bin/java

[root@server Desktop]# /usr/java/jdk1.7.0_51/bin/java  -version

java version “1.7.0_51″

Java(TM) SE Runtime Environment (build 1.7.0_51-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

[root@server Desktop]# java -version

java version “1.7.0_09-icedtea”

OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)

OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

[root@server Desktop]# echo $JAVA_HOME

/usr

[root@server Desktop]# JAVA_HOME=/usr/java/jdk1.7.0_51/

[root@server Desktop]# echo $JAVA_HOME

/usr/java/jdk1.7.0_51/

[root@server Desktop]# java -version

java version “1.7.0_09-icedtea”

OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)

OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

[root@server Desktop]# PATH=$JAVA_HOME/bin:$PATH

Note: $JAVA_HOME must be put first then $PATH in above cmd

[root@server Desktop]# java -version

java version “1.7.0_51″

Java(TM) SE Runtime Environment (build 1.7.0_51-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Step 4:

[root@server Desktop]# vim /root/.bash_profile

export JAVA_HOME=/usr/java/jdk1.7.0_51/

PATH=$JAVA_HOME/bin:$PATH

[root@server Desktop]# . /root/.bash_profile

Step 5:

Hadoop is by default setup but we need to configure java path on its internal file

[root@localhost /]# vim /etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_51/

# The maximum amount of heap to use, in MB. Default is 1000.

export HADOOP_HEAPSIZE=500

to test it is working, run below cmd

[root@localhost /]# hadoop fs -ls /

Step 6: Setup HDFS name and data node

[root@server hadoop]# vim /etc/hadoop/hdfs-site.xml

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/data/nodename</value>

<final>true</final>

</property>

<property>

<name>dfs.data.dir</name>

<value>/data/dataname</value>

<final>true</final>

</property>

<property>

<name>dfs.replication</name>

<value>3</value>

<final>true</final>

</property>

<property>

<name>dfs.block.size</name>

<value>134217728</value>

<final>true</final>

</property>

</configuration>

note: above directory automatically created, no need to create before

root@server hadoop]# hadoop namenode –format

Step 7: To start name and data node

[root@server hadoop]# vim /etc/hadoop/core-site.xml

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://127.0.0.1:10001</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/usr/local/hadoop/tmp</value>

</property>

</configuration>

[root@server hadoop]# hadoop-daemon.sh  start namenode

above cmd start some port, run

#netstat  -tnlp | grep java

tcp        0      0 127.0.0.1:10001             0.0.0.0:*                   LISTEN      14969/java

tcp        0      0 0.0.0.0:50070               0.0.0.0:*                   LISTEN      14969/java

[root@server hadoop]# hadoop-daemon.sh  start datanode

above cmd start some port, run

#netstat  -tnlp | grep java

tcp        0      0 0.0.0.0:50010               0.0.0.0:*                   LISTEN      15093/java

tcp        0      0 0.0.0.0:50075               0.0.0.0:*                   LISTEN      15093/java

To verify:

[root@server hadoop]# jps

8177 Jps

8126 DataNode

7933 NameNode

Or go to url, as “50070” is name node management port

http://127.0.0.1:50070 

in CLI,we can also see the report

[root@server hadoop]# hadoop dfsadmin –report

You can check hadoop hdfs filesytem,initially there is nothing

# hadoop fs -ls /

Create directory in hdfs filesystem

# hadoop fs -mkdir /input

Upload or copy local file into hdfs filesystem

# hadoop fs -copyFromLocal  test.txt  /input

Note : it uploaded to datanode at the storage folder named “current” in distributed fashion of maximum file size “64MB” as bcoz  by default block size is 64MB

You can change block size in hdfs-site.xml

<property>

<name>dfs.block.size</name>

<value>134217728</value>

<final>true</final>

</property>

Bydefault it copy to 3 datanode, as by default replication is 3, you can change it in hdfs-site.xml

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

List file in hdfs

# hadoop fs -ls /input

# hadoop fs -lsr /

How to Setup Map Reduce

Step 1:

Setup Mapred-site.xml file:

# vim /etc/hadoop/mapred-site.xml

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>192.168.0.16:9001</value>

</property>

</configuration>

Step 2: start jobtracker

# hadoop-daemon.sh  start jobtracker

starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-desktop16.example.com.out

# jps

7247 JobTracker

6467 DataNode

6541 NameNode

7325 Jps

Note: it start 2 new port, check

# netstat -tnlp | grep java

tcp        0      0 0.0.0.0:50030               0.0.0.0:*                   LISTEN      7411/java

tcp        0      0 192.168.0.16:9001           0.0.0.0:*                   LISTEN      7411/java

Where, 50030 is management port for mapreduce,

Check it : http://127.0.0.1:50030

Step 3:  start tasktracker

# hadoop-daemon.sh  start tasktracker

starting tasktracker, logging to /var/log/hadoop/root/hadoop-root-tasktracker-desktop16.example.com.out

# jps

7639 Jps

7569 TaskTracker

6467 DataNode

7411 JobTracker

6541 NameNode

Step 4:Test your setup, by run example file from hadoop rpm

You can get it here

# rpm -ql hadoop | grep examples

/usr/share/hadoop/hadoop-examples-1.2.1.jar

# hadoop jar  /usr/share/hadoop/hadoop-examples-1.2.1.jar   wordcount /input  /output

14/05/07 14:38:01 INFO input.FileInputFormat: Total input paths to process : 1

14/05/07 14:38:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library

14/05/07 14:38:01 WARN snappy.LoadSnappy: Snappy native library not loaded

14/05/07 14:38:02 INFO mapred.JobClient: Running job: job_201405071431_0001

14/05/07 14:38:03 INFO mapred.JobClient:  map 0% reduce 0%

14/05/07 14:38:12 INFO mapred.JobClient:  map 100% reduce 0%

14/05/07 14:38:20 INFO mapred.JobClient:  map 100% reduce 33%

14/05/07 14:38:21 INFO mapred.JobClient:  map 100% reduce 100%

14/05/07 14:38:22 INFO mapred.JobClient: Job complete: job_201405071431_0001

# hadoop job -list all

1 jobs submitted

States are:

Running : 1     Succeded : 2    Failed : 3      Prep : 4

JobId   State   StartTime       UserName        Priority        SchedulingInfo

job_201405071431_0001   2       1399453681859   root    NORMAL  NA

# hadoop fs -ls /output

Found 3 items

-rw-r–r–   3 root supergroup          0 2014-05-07 14:38 /output/_SUCCESS

drwxr-xr-x   – root supergroup          0 2014-05-07 14:38 /output/_logs

-rw-r–r–   3 root supergroup         34 2014-05-07 14:38 /output/part-r-00000

Note: _SUCCESS file created means map reduce is successfully done

Note:  part-r-00000 contains output of reducer , final output

You can see the final output of map reduce job

# hadoop fs -cat /output/part-r-00000

And we can also see by web UI

http://127.0.0.1:50070  -> Browse the filesystem

if you want to list complete details of running or completed job, then use job id with status option

# hadoop job -status  job_201405071431_0004

Job: job_201405071431_0004

file: hdfs://192.168.0.16:10001/tmp/hadoop-root/mapred/staging/root/.staging/job_201405071431_0004/job.xml

tracking URL: http://desktop16.example.com:50030/jobdetails.jsp?jobid=job_201405071431_0004

map() completion: 0.017579561

reduce() completion: 0.0

Counters: 3

Job Counters

SLOTS_MILLIS_MAPS=2481

Launched map tasks=2

Data-local map tasks=2

# hadoop job -list

1 jobs currently running

JobId   State   StartTime       UserName        Priority        SchedulingInfo

job_201405071431_0004   1       1399486864042   root    NORMAL  NA

if you want to kill running process

# hadoop job -kill   job_201405071431_0004

Killed job job_201405071431_0004

If you want to change the priority of one job over other

# hadoop job -set-priority   job_201405071431_0004 LOW

Changed job priority.

You can list by below command

#hadoop job -list all

4 jobs submitted

States are:

Running : 1     Succeded : 2    Failed : 3      Prep : 4

JobId   State   StartTime       UserName        Priority        SchedulingInfo

job_201405071431_0001   2       1399453681859   root    NORMAL  NA

job_201405071431_0002   3       1399462379102   root    NORMAL  NA

job_201405071431_0003   2       1399462502071   root    NORMAL  NA

job_201405071431_0004   2       1399486864042   root    LOW     NA

 

BY default FIFO scheduler is used in “Apache Hadoop”

We can change to “Fair Scheduler” in mapred-site.xml file

Step 1:

# vim /etc/hadoop/mapred-site.xml

Step 2:

In mapred-site.xml of job tracker,specify the scheduler used :

<property>

<name>mapred.jobtracker.taskScheduler</name>

<value>org.apache.hadoop.mapred.FairScheduler</value>

</property>

Identify the pool configuration file :

<property>

<name>mapred.fairscheduler.allocation.file</name>

<value>/etc/hadoop/fair-scheduler.xml</value>

</property>

Step 3:

# vim /etc/hadoop/fair-scheduler.xml

<allocations>

<pool name=”tech”>

<minMaps>10</minMaps>

<minReduces>5</minReduces>

<maxRunningJobs>2</maxRunningJobs>

</pool>

<pool name=”hr”>

<minMaps>10</minMaps>

<minReduces>5</minReduces>

</pool>

<user name=”vimal”>

<maxRunningJobs>2</maxRunningJobs>

</user>

</allocations>

Step 4:

# hadoop-daemon.sh  stop jobtracker

stopping jobtracker

# hadoop-daemon.sh  start jobtracker

starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-desktop16.example.com.out

Step 5:

Run job with pool name “tech”

# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar  wordcount  -Dpool.name=tech  /input /output3


Viewing all articles
Browse latest Browse all 10

Trending Articles