23 September 2015

Loading Stack Exchange Data Dumps to Hadoop and Hive

These days I am learning a lot about Hadoop, and as part of that I need some data to play with. A lot of Hadoop examples run against Twitter data, or airline data, but I decided it would be fun to look at the StackExchange data dumps instead. The people at StackExchange kindly supply data dumps in XML format for the various StackExchange sites.

The Stack Overflow dump is quite large, so I have been experimenting with the Serverfault dump instead, which is about 350MB compressed.

As I write this, the dump file contains the following files:

Badges.xml
Comments.xml
PostHistory.xml
PostLinks.xml
Posts.xml
Tags.xml
Users.xml
Votes.xml

I guess the file format can change from time to time, but the overall definition seems fairly consistent with this post. The major difference is that some of the tables mentioned in the post are not in the dump, and the static lookup tables are missing, but they can easily be created.

Hive can deal with XML files, and the XML from the data dumps is pretty simple, with each row being wrapped in a tag with the column values encoded inside attributes, for example:

<row Id="1" PostId="1" VoteTypeId="2" CreationDate="2009-04-30T00:00:00.000" />

In an effort to learn a bit about Avro, and make things hard for myself, I decided to convert all the XML files into avro files and then load them to Hive.

The easiest way to do this, is to create a Hive table pointing at each XML file and then create a new Avro table using a CREATE TABLE AS SELECT statement. I decided it would be more educational to write a map reduce job to do this instead, making things even more difficult for myself.

The Short Version

Download a data dump, decompress and put into the input folder in your hdfs home directory. Serverfault is a manageable size.
Clone the git repo
mvn package -Dmaven.test.skip=true. A jar file will be created inside the target directory (Wordcount-1.0-SNAPSHOT.jar)
Download avro-1.7.7.jar and avro-mapred-1.7.7-hadoop2.jar from here
On your Hadoop client box run the map reduce job:

export LIBJARS=avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar
export HADOOP_CLASSPATH=avro-1.7.7.jar:avro-mapred-1.7.7-hadoop2.jar
hadoop jar WordCount-1.0-SNAPSHOT.jar com.sodonnel.stackOverflow.xmlToAvro -libjars $LIBJARS input output

Create the hive structures by running data_set/stackoverflow/create_directories.sh in the git repo

For more explanation, read on ...

Load Raw Data To Hadoop

The Stack Exchange dumps come compressed in 7z format (which I had never used before). The first step is to decompress them and load to Hadoop. I put them into a folder called input in my HDFS home directory:

[vagrant@standalone serverfault.com]$ ls -ltrh
total 2.0G
-rw-r--r-- 1 vagrant vagrant 168M Sep 23 15:23 Comments.xml
-rw-r--r-- 1 vagrant vagrant  39M Sep 23 15:23 Badges.xml
-rw-r--r-- 1 vagrant vagrant 2.9M Sep 23 15:23 PostLinks.xml
-rw-r--r-- 1 vagrant vagrant 916M Sep 23 15:23 PostHistory.xml
-rw-r--r-- 1 vagrant vagrant 139M Sep 23 15:23 Votes.xml
-rw-r--r-- 1 vagrant vagrant  64M Sep 23 15:23 Users.xml
-rw-r--r-- 1 vagrant vagrant 224K Sep 23 15:23 Tags.xml
-rw-r--r-- 1 vagrant vagrant 625M Sep 23 15:23 Posts.xml

[vagrant@standalone serverfault.com]$ hadoop fs -put *.xml input/

[vagrant@standalone serverfault.com]$ hadoop fs -ls input
Found 8 items
-rw-r--r--   3 vagrant vagrant   40661120 2015-09-23 15:24 input/Badges.xml
-rw-r--r--   3 vagrant vagrant  176150364 2015-09-23 15:24 input/Comments.xml
-rw-r--r--   3 vagrant vagrant  959680230 2015-09-23 15:24 input/PostHistory.xml
-rw-r--r--   3 vagrant vagrant    2988058 2015-09-23 15:24 input/PostLinks.xml
-rw-r--r--   3 vagrant vagrant  654708460 2015-09-23 15:24 input/Posts.xml
-rw-r--r--   3 vagrant vagrant     228933 2015-09-23 15:24 input/Tags.xml
-rw-r--r--   3 vagrant vagrant   66168226 2015-09-23 15:24 input/Users.xml
-rw-r--r--   3 vagrant vagrant  144769493 2015-09-23 15:25 input/Votes.xml

Convert XML to Avro

Next step is to run a map only map reduce job to convert the XML files to Avro. I have documented that process previously.

After running that job, there should be a bunch of avro files in the output directory:

[vagrant@standalone]$ hadoop fs -ls output
Found 33 items
-rw-r--r--   3 vagrant vagrant          0 2015-09-23 18:28 output/_SUCCESS
-rw-r--r--   3 vagrant vagrant   22478186 2015-09-23 18:28 output/badges-m-00016.avro
-rw-r--r--   3 vagrant vagrant  111145713 2015-09-23 18:21 output/comments-m-00001.avro
-rw-r--r--   3 vagrant vagrant   35046851 2015-09-23 18:28 output/comments-m-00015.avro
-rw-r--r--   3 vagrant vagrant    1330504 2015-09-23 18:28 output/postlinks-m-00018.avro
-rw-r--r--   3 vagrant vagrant  141311747 2015-09-23 18:25 output/posts-m-00009.avro
-rw-r--r--   3 vagrant vagrant  140294479 2015-09-23 18:25 output/posts-m-00010.avro
-rw-r--r--   3 vagrant vagrant  139147884 2015-09-23 18:26 output/posts-m-00011.avro
-rw-r--r--   3 vagrant vagrant  138334514 2015-09-23 18:27 output/posts-m-00012.avro
-rw-r--r--   3 vagrant vagrant  120587513 2015-09-23 18:27 output/posts-m-00013.avro
-rw-r--r--   3 vagrant vagrant     119501 2015-09-23 18:28 output/tags-m-00019.avro
-rw-r--r--   3 vagrant vagrant   78929530 2015-09-23 18:28 output/users-m-00014.avro
-rw-r--r--   3 vagrant vagrant   83896117 2015-09-23 18:20 output/votes-m-00000.avro

Create Hive Directory Structure

A Hive table points at a directory, and we want to to have a table for each of the original Stack Exchange files, so the next step is to create a simple directory structure and moved the files in the output directory into the correct place. The following bash script should do the trick:

BASE_DIR=/user/vagrant/hive/stackoverflow

dirs=("users" "posts" "comments" "tags" "votes" "badges" "postlinks")
for i in "${dirs[@]}"
do
    hadoop fs -mkdir -p ${BASE_DIR}/${i}
    hadoop fs -mv output/${i}*.avro ${BASE_DIR}/${i}/
done

Create Hive Tables

Now that the data is converted and moved into place, all that is left is to create some Hive tables. Current Hive versions (>= 0.14 I think) make creating tables over Avro files very simple - you don't even need to specify the Avro schema in the table definition, as it is derived from columns in the create table statement. So this step is as simple as create table statements, eg:

CREATE EXTERNAL TABLE users(
  id string,
  reputation string,
  creationdate string,
  displayname string,
  lastaccessdate string,
  websiteurl string,
  location string,
  aboutme string,
  views string,
  upvotes string,
  downvotes string,
  age string,
  accountid string,
  profileimageurl string
  )
STORED AS AVRO location '/user/vagrant/hive/stackoverflow/users';

Get the full script on Github.

Run Queries in Hive

Now you should be able to run queries against the data in Hive:

hive> select count(*) from posts;
Query ID = vagrant_20150923184848_60c6c182-9b95-438d-875d-0ed433b4c7ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1442929182717_0016, Tracking URL = http://standalone:8088/proxy/application_1442929182717_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1442929182717_0016
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2015-09-23 18:48:43,080 Stage-1 map = 0%,  reduce = 0%
2015-09-23 18:48:53,565 Stage-1 map = 12%,  reduce = 0%, Cumulative CPU 6.78 sec
2015-09-23 18:48:54,599 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 8.02 sec
2015-09-23 18:49:05,033 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 14.75 sec
2015-09-23 18:49:07,105 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 16.33 sec
2015-09-23 18:49:15,493 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.12 sec
2015-09-23 18:49:19,695 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 22.33 sec
MapReduce Total cumulative CPU time: 22 seconds 330 msec
Ended Job = job_1442929182717_0016
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 22.33 sec   HDFS Read: 680169669 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 330 msec
OK
537109
Time taken: 46.837 seconds, Fetched: 1 row(s)

TODOs

You may have noticed I have cheated in a major area - Every column in every table is a string, including the dates. This would be much more useful if those were true Hive timestamp columns.