06 January 2015

Unit Testing Map Reduce Programs With MRUnit

Last time, I described how to create a very simple map reduce program in Java. The next problem you run into is how to write some unit tests for this program.

The nice thing about testing many Map Reduce programs is that each stage of the process is generally very simple. For instance, a mapper receives a line of a file and does some transform on it to output a set of key values pairs.

This means the mapper can be tested in isolation from the reducer, and even a job that runs through many map and reduce phases can be tested one stage at a time.

To make testing Map Reduce programs easier, the Hadoop project contains a tool called MRUnit. MRUnit is based on JUnit, so its syntax should be pretty familiar.

To include MRUnit in your project add the following to the pom.xml:

<dependency>
    <groupId>org.apache.mrunit</groupId>
    <artifactId>mrunit</artifactId>
    <version>1.1.0</version>
    <classifier>hadoop2</classifier>
    <scope>test</scope>
</dependency>

Building on the code from my last article, its pretty simple to create some simple unit tests for the mapper:

package com.sodonnel.Hadoop;

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver; 
import org.junit.*;

public class WordCountMapperTest {

    MapDriver<Object, Text, Text, IntWritable> mapDriver;

    @Before
    public void setup() {
        WordCountMapper mapper = new WordCountMapper();
        mapDriver = MapDriver.newMapDriver(mapper);
    }

    @Test
    public void splitValidRecordIntoTokens() throws IOException, InterruptedException {
        Text value = new Text("the,quick,brown,fox,the");
        mapDriver.withInput(new LongWritable(), value)
                .withOutput(new Text("the"), new IntWritable(1)) 
                .withOutput(new Text("quick"), new IntWritable(1)) 
                .withOutput(new Text("brown"), new IntWritable(1)) 
                .withOutput(new Text("fox"), new IntWritable(1)) 
                .withOutput(new Text("the"), new IntWritable(1)) 
                .runTest();
    } 

    @Test
    public void recordWithSingleWordIsValid() throws IOException, InterruptedException {
        Text value = new Text("the");
        mapDriver.withInput(new LongWritable(), value)
                .withOutput(new Text("the"), new IntWritable(1)) 
                .runTest();
    }

    @Test
    public void recordWithEmptyLineOutputsNothing() throws IOException, InterruptedException {
        Text value = new Text("");
        // If you don't specify any 'withOutput' lines, then it EXPECTs no output. 
        // If there is output it will fail the test
        mapDriver.withInput(new LongWritable(), value)
                .runTest();
    }
}

Notice that the setup method creates a MapDriver object, which has the same definition as the Mapper class under test.

Then, using the MapDriver, you define the input, any expected output and run the test - simple.

Reducer Tests

Writing Reducer tests is just as easy as mapper tests:

package com.sodonnel.Hadoop;

import java.io.IOException;
import java.util.Arrays;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver; 
import org.junit.*;

public class WordCountReducerTest {

    ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;

    @Before
    public void setup() {
        WordCountReducer reducer = new WordCountReducer();
        reduceDriver = ReduceDriver.newReduceDriver(reducer);
    }

    @Test
    public void splitValidRecordIntoTokens() throws IOException, InterruptedException {
        reduceDriver.withInput(new Text("the"), Arrays.asList(new IntWritable(1), new IntWritable(2)))
                .withOutput(new Text("the"), new IntWritable(3)) 
                .runTest();
    } 

}

This time you create a ReduceDriver object, pass it the expected input and output in a similar way.

Testing The Driver

After testing mapper and reducer in isolation, it is a good idea to have a couple of checks on the driver class that bolts the map reduce job together.

It turns out this is pretty simple, and it doesn't even requite MRUnit to perform the tests.

Hadoop allows you run map reduce jobs in a local mode, where files are read from and written to the local file system instead of HDFS. It is also possible to build a configuration object and pass it to the config class, instead of requiring all the usual xml config files.

To test the driver, you build a configuration (making sure it sets local mode), instantiate the driver class, pass the config and any command line arguments and then run the job - the following code gives an example:

@Test
public void test() throws Exception {
    Configuration conf = new Configuration(); 
    conf.set("fs.defaultFS", "file:///"); 
    conf.set("mapred.framework.name", "local");
    Path input = new Path("input"); 
    Path output = new Path("output");

    FileSystem fs = FileSystem.getLocal(conf); 
    fs.delete(output, true); // delete old output
    WordCount driver = new WordCount();
    driver.setConf(conf);
    int exitCode = driver.run(new String[] { input.toString(), output.toString() });
    assertThat(exitCode, is(0));
    // checkOutput(conf, output);
}

Notice that in the conf object, mapred.framework.name is set to local - this tells Hadoop that it is running in a local, single VM mode.

Also notice that in the 3rd last line, in the parameters to the run method, an array is passed. This is how you simulate passing command line parameters to the job. The parameters are received by the driver class just as if they were passed by the Hadoop command line program.

I have commented out the last line, as it is specific to each test - you probably want to run the job with a know input and an expected output and validate the actual output matches up with what is expected.

Note - I was not able to get this sort of test working on Windows. It worked fine on OS X and Linux.

Running The Job against Local Files

Once you have a unit tested a map reduce job, the next stage is to pass some actual data to the program and see if it works end to end.

Hadoop has a local mode that allows you to run an entire map reduce program in a single JVM, and hence without a full Hadoop cluster - you can get away with just the Hadoop Client libraries installed.

To do this, you override the job tracker (Map Reduce V1) or the mapreduce.framework.name (YARN), setting it to the special value of local, which is actually the default.

You can also override the default filesystem, telling it to use the local filesystem instead of HDFS.

Place the following in a file called config/hadoop-local.xml:

<?xml version="1.0"?>
  <configuration>
    <property>
      <name>fs.default.name</name>
      <value>file:///</value>
    </property>
    <property>
      <name>mapred.framework.name</name>
      <value>local</value>
    </property>
</configuration>

Then create a directory called input and output, and put a file containing some CSV data into the input directory.

This will give you a directory structure that looks like:

MapReduce-0.0.1-SNAPSHOT.jar
input
  data.csv
output
config
  hadoop-local.xml

Now you can run the local job using the following command:

$ hadoop --config config jar MapReduce-0.0.1-SNAPSHOT.jar com.sodonnel.MapReduce.WordCount input output

The job should run very quickly (assuming your csv file is small) compared to running the job on a real cluster, and the output will be written into the output directory.

Apparently you can use this local Hadoop mode to run the job inside an IDE and debug it etc, but I have yet to get that working.

blog comments powered by Disqus