I’m trying to teach myself Hadoop on my laptop. My objective is to get the pseudo distributed mode running.
I’m following the guide from the Apache website to set up Hadoop and HDFS in Ubuntu, but I can’t get it to work. Here are the steps I have followed so far:
1) check Java version:
sudo apt-get update sudo apt-get install default-jdk java -version
returns:
openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
2) obtain Hadoop 2.7:
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz cd /home/me/Downloads tar zxf hadoop-2.7.3.tar.gz mv hadoop-2.7.3 /home/me
3) link Hadoop to JAVA.
replace
export JAVA_HOME=${JAVA_HOME}
by
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
in
gedit /home/me/hadoop-2.7.3/etc/hadoop/hadoop-env.sh
4) add SSH:
sudo apt-get install openssh-server sudo apt-get install ssh sudo apt-get install rsync
5) add /home/me/hadoop-2.7.3/bin
and /home/me/hadoop-2.7.3/sbin
to the PATH:
cd gedit .bashrc
and add:
export PATH=$PATH:/home/me/hadoop-2.7.3/bin export PATH=$PATH:/home/me/hadoop-2.7.3/sbin source .bashrc
7) Now, I’m trying to set up the Pseudo-Distributed Operation mode. Still
following the instructions, I change /home/me/hadoop-2.7.3/etc/hadoop/core-site.xml
by adding
<property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property
in the <configuration>
block and I change /home/me/hadoop-2.7.3/etc/hadoop/hdfs-site.xml
by adding
<property> <name>dfs.replication</name> <value>1</value> </property>
8) Following the instructions, doing:
hdfs namenode -format
seems to work (yields Y/N prompt and lot of texts on the screen).
9) start hdfs:
start-dfs.sh
also seems to work (prompts a couple of passwords).
10) Create the folder structure for input. Doing
hdfs dfs -mkdir /user hdfs dfs -mkdir /user/hduser/ hdfs dfs -mkdir /user/hduser/input/
works. But now, doing
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/hduser/input/
yields:
16/12/12 14:53:14 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hduser/input/salaries.csv._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy10.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy11.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1455) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1251) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448) put: File /user/hduser/input/salaries.csv._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
and
hdfs dfs -ls /user/hduser/input
doesn’t show anything;(
Edit:
After reading Arduino Sentinel’s answer, my hdfs-site.xml
file is:
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/me/Desktop/work/cv/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/me/Desktop/work/cv/hadoop/datanode</value> </property> </configuration>
and both /home/me/Desktop/work/cv/hadoop/datanode
and /home/me/Desktop/work/cv/hadoop/namenode
exist.
make sure that /home/me/Desktop/work/cv/hadoop/datanode
and /home/me/Desktop/work/cv/hadoop/namenode
are empty:
rm -rf /home/me/Desktop/work/cv/hadoop/namenode/* rm -rf /home/me/Desktop/work/cv/hadoop/datanode/*
and now doing
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/hduser/input/
does not return an error message and doing:
hdfs dfs -ls /user/hduser/input
yields the desired result:
Found 1 items -rw-r--r-- 3 me supergroup 1771685 2016-12-20 12:23 /user/hduser/input/salaries.csv
Advertisement
Answer
Your hdfs-site.xml
should have dfs.namenode.name.dir
and dfs.datanode.data.dir
properties that points to a local directory in order name node and datanode to start.
<property> <name>dfs.namenode.name.dir</name> <value>/<local-dir path>/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/<local-dir path>r/datanode</value> </property>