Skip to content
Advertisement

Setting up Hadoop in Pseudo-distributed mode in ubuntu

I’m trying to teach myself Hadoop on my laptop. My objective is to get the pseudo distributed mode running.

I’m following the guide from the Apache website to set up Hadoop and HDFS in Ubuntu, but I can’t get it to work. Here are the steps I have followed so far:

1) check Java version:

sudo apt-get update
sudo apt-get install default-jdk
java -version

returns:

openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)

2) obtain Hadoop 2.7:

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
cd /home/me/Downloads
tar zxf hadoop-2.7.3.tar.gz
mv hadoop-2.7.3 /home/me

3) link Hadoop to JAVA.

replace

export JAVA_HOME=${JAVA_HOME}

by

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

in

gedit /home/me/hadoop-2.7.3/etc/hadoop/hadoop-env.sh

4) add SSH:

sudo apt-get install openssh-server
sudo apt-get install ssh
sudo apt-get install rsync

5) add /home/me/hadoop-2.7.3/bin and /home/me/hadoop-2.7.3/sbin to the PATH:

cd 
gedit .bashrc

and add:

export PATH=$PATH:/home/me/hadoop-2.7.3/bin
export PATH=$PATH:/home/me/hadoop-2.7.3/sbin
source .bashrc

7) Now, I’m trying to set up the Pseudo-Distributed Operation mode. Still following the instructions, I change /home/me/hadoop-2.7.3/etc/hadoop/core-site.xml by adding

 <property>         
    <name>fs.defaultFS</name>         
    <value>hdfs://localhost:9000</value>    
 </property

in the <configuration> block and I change /home/me/hadoop-2.7.3/etc/hadoop/hdfs-site.xml by adding

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

8) Following the instructions, doing:

hdfs namenode -format

seems to work (yields Y/N prompt and lot of texts on the screen).

9) start hdfs:

start-dfs.sh

also seems to work (prompts a couple of passwords).

10) Create the folder structure for input. Doing

hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/hduser/
hdfs dfs -mkdir /user/hduser/input/

works. But now, doing

hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/hduser/input/

yields:

16/12/12 14:53:14 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hduser/input/salaries.csv._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 0 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

    at org.apache.hadoop.ipc.Client.call(Client.java:1475)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy11.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1455)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1251)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
put: File /user/hduser/input/salaries.csv._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 0 datanode(s) running and no node(s) are excluded in this operation.

and

hdfs dfs -ls /user/hduser/input

doesn’t show anything;(

Edit:

After reading Arduino Sentinel’s answer, my hdfs-site.xml file is:

<configuration>
    <property>
       <name>dfs.namenode.name.dir</name>
       <value>/home/me/Desktop/work/cv/hadoop/namenode</value>
    </property>
    <property>
       <name>dfs.datanode.data.dir</name>
       <value>/home/me/Desktop/work/cv/hadoop/datanode</value>
    </property>
</configuration>

and both /home/me/Desktop/work/cv/hadoop/datanode and /home/me/Desktop/work/cv/hadoop/namenode exist.

make sure that /home/me/Desktop/work/cv/hadoop/datanode and /home/me/Desktop/work/cv/hadoop/namenode are empty:

rm -rf  /home/me/Desktop/work/cv/hadoop/namenode/*
rm -rf  /home/me/Desktop/work/cv/hadoop/datanode/*

and now doing

hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/hduser/input/

does not return an error message and doing:

hdfs dfs -ls /user/hduser/input

yields the desired result:

Found 1 items
-rw-r--r--   3 me supergroup    1771685 2016-12-20 12:23 /user/hduser/input/salaries.csv

Advertisement

Answer

Your hdfs-site.xml should have dfs.namenode.name.dir and dfs.datanode.data.dir properties that points to a local directory in order name node and datanode to start.

 <property>
   <name>dfs.namenode.name.dir</name>
   <value>/<local-dir path>/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>/<local-dir path>r/datanode</value>
 </property>
Advertisement