Skip to content
Advertisement

How to let Apache Spark on Windows access Hadoop on Linux?

First, I have almost no experience on Apache Hadoop and Apache Spark.

What I want for now is as follows:

  • Hadoop is running on Hortonworks Sandbox 2.1, which is installed on a Windows 7 machine.
  • Spark shell and Spark programs are running on the Windows 7 machine, which is the same machine as above.
  • Spark shell and Spark programs can use the Hadoop on Hortonworks Sandbox 2.1.

The reason I want to run Spark on Windows is, I want to develop Spark programs on a Windows 7 machine, which is familiar to me.

Installing Hortonworks Sandbox 2.1 was straightforward.

Then I tried to build Spark on Windows 7 machine as follows:

  • Install JDK 7, Git, sbt (JDK 8 is not working for sbt, at least for now: 2014.5.8)
  • git clone git://github.com/apache/spark.git (spark-0.9.1 does not compile with Hadoop 2.4.0 when SPARK_YARN=true, as described in https://issues.apache.org/jira/browse/SPARK-1465)
  • Run cmd
  • cd to spark root directory
  • set SPARK_HADOOP_VERSION=2.4.0 (Hadoop on Hortonworks Sandbox 2.1 is 2.4.0.2.1.1.0-385)
  • set SPARK_YARN=true
  • sbt assembly

Done. Successful.
But when I ran binspark-shell, the following warning appeared.

14/05/08 11:26:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/05/08 11:26:15 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable nullbinwinutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
    ...

Apparently spark shell requires winutil.exe.
I found that winutils.exe is part of Hadoop, but it was not included in the normal prebuilt distribution.
So I built Hadoop on Windows.
With some trouble, I managed to build hadoop-2.4.0.tar.gz, which includes winutils.exe and Hadoop.dll.

But… I don’t know what to do from now on.
How can I install or apply the built Hadoop package so that Spark can get it to access Hadoop on the Hortonworks Sandbox 2.1?

Any suggestions are welcome.

Advertisement

Answer

I’ve managed to successfully setup the above configuration with Spark 1.0.0.

It’s somewhat long story, but most problems were configuration related. Perhaps an experienced Spark + Hadoop developer would have no problem, except for the one I’ll write below.
And the above question was for Spark 0.9.1, which is now out of date, so it’s not so useful to answer it.

But one problem is a cross-platform issue, and still applies to Spark 1.0.0.
I’ve created a pull request for it: https://github.com/apache/spark/pull/899
If interested, follow the link.

UPDATE: The above cross-platform issue was resolved on version 1.3.0.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement