10 December 2014

Maven Config For Cloudera Map Reduce Programs

I have been working with Hadoop for a while now, and I have been able to achieve everything I need with a combination of Sqoop, Oozie, Hive and shell scripts. Given a bit of free time, I decided it would be worth exploring how to create simple map reduce jobs in Java.

First install Maven (Java build tool and dependency manager) and Eclipse (Java IDE) on your development machine.

Then create a new Maven project using Eclipse or from the command line:

$ mvn archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.sodonnel.Hadoop -DartifactId=WordCount

This will create a directory called WordCount and inside it you will find a java project structure and a pom.xml file, which is the maven config file.

As we want to create a Hadoop Map Reduce program, we need to add the Hadoop dependencies to our project. Searching the web, people seem to put all sorts of dependencies into their pom.xml for Hadoop jobs, but I found I only need a few entries - one to specify the Cloudera Maven repo, another to bring in the Hadoop dependencies and then a couple more to allow me to write unit tests against map reduce jobs. My complete pom.xml is:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.sodonnel.Hadoop</groupId>
  <artifactId>WordCount</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>WordCount</name>
  <url>http://maven.apache.org</url>

  <repositories>
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>

    <dependency>
      <groupId>org.apache.mrunit</groupId>
      <artifactId>mrunit</artifactId>
      <version>1.1.0</version>
      <classifier>hadoop2</classifier> 
    </dependency>

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.5.0-cdh5.2.1</version>
    </dependency>

  </dependencies>
</project>

There are a couple of things to watch out for.

Cloudera have a different version of the hadoop-client library for Map Reduce V1 and YARN clusters, and the version you should use is determined by the version identifier.

2.5.0-cdh5.2.1 for YARN
2.5.0-mr1-cdh5.2.1 for Map Reduce V1

The second thing to be aware of, is picking the correct version for your Cloudera cluster - this link is useful to figure that out.

Download Project Dependencies and Compile

At this point, you will want to install all the project dependencies into your local maven repo:

$ mvn clean install

This should pull down all packages required to run and compile your application. This command will also compile your application and build it into a JAR, ready for execution. We have not added any source files to this project as yet, but maven put a 'hello world' class in for us. Inside the target directory of your project, you should find a file called WordCount-1.0-SNAPSHOT.jar, which is the compiled application.

Adding To Eclipse

If you created the project from the command line and want to import it into Eclipse, then run the following command to generate the Eclipse project files:

$ mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true

This ran for quite a while on my system as the Java Doc files were quite large. Finally, import the project into Eclipse using file>import then expand Maven and select 'Existing Maven Projects'.

My next article will have some information on creating a simple map reduce job and running it against Hadoop.