23 September 2015
These days I am learning a lot about Hadoop, and as part of that I need some data to play with. A lot of Hadoop examples run against Twitter data, or airline data, but I decided it would be fun to look at the StackExchange data dumps instead. The people at StackExchange kindly supply data dumps in XML format for the various StackExchange sites.
The Stack Overflow dump is quite large, so I have been experimenting with the Serverfault dump instead, which is about 350MB compressed.
As I write this, the dump file contains the following files:
I guess the file format can change from time to time, but the overall definition seems fairly consistent with this post. The major difference is that some of the tables mentioned in the post are not in the dump, and the static lookup tables are missing, but they can easily be created.
Hive can deal with XML files, and the XML from the data dumps is pretty simple, with each row being wrapped in a
<row Id="1" PostId="1" VoteTypeId="2" CreationDate="2009-04-30T00:00:00.000" />
In an effort to learn a bit about Avro, and make things hard for myself, I decided to convert all the XML files into avro files and then load them to Hive.
The easiest way to do this, is to create a Hive table pointing at each XML file and then create a new Avro table using a CREATE TABLE AS SELECT statement. I decided it would be more educational to write a map reduce job to do this instead, making things even more difficult for myself.
mvn package -Dmaven.test.skip=true
. A jar file will be created inside the target directory (Wordcount-1.0-SNAPSHOT.jar)export LIBJARS=avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar
export HADOOP_CLASSPATH=avro-1.7.7.jar:avro-mapred-1.7.7-hadoop2.jar
hadoop jar WordCount-1.0-SNAPSHOT.jar com.sodonnel.stackOverflow.xmlToAvro -libjars $LIBJARS input output
data_set/stackoverflow/create_directories.sh
in the git repoFor more explanation, read on ...
The Stack Exchange dumps come compressed in 7z format (which I had never used before). The first step is to decompress them and load to Hadoop. I put them into a folder called input in my HDFS home directory:
[vagrant@standalone serverfault.com]$ ls -ltrh
total 2.0G
-rw-r--r-- 1 vagrant vagrant 168M Sep 23 15:23 Comments.xml
-rw-r--r-- 1 vagrant vagrant 39M Sep 23 15:23 Badges.xml
-rw-r--r-- 1 vagrant vagrant 2.9M Sep 23 15:23 PostLinks.xml
-rw-r--r-- 1 vagrant vagrant 916M Sep 23 15:23 PostHistory.xml
-rw-r--r-- 1 vagrant vagrant 139M Sep 23 15:23 Votes.xml
-rw-r--r-- 1 vagrant vagrant 64M Sep 23 15:23 Users.xml
-rw-r--r-- 1 vagrant vagrant 224K Sep 23 15:23 Tags.xml
-rw-r--r-- 1 vagrant vagrant 625M Sep 23 15:23 Posts.xml
[vagrant@standalone serverfault.com]$ hadoop fs -put *.xml input/
[vagrant@standalone serverfault.com]$ hadoop fs -ls input
Found 8 items
-rw-r--r-- 3 vagrant vagrant 40661120 2015-09-23 15:24 input/Badges.xml
-rw-r--r-- 3 vagrant vagrant 176150364 2015-09-23 15:24 input/Comments.xml
-rw-r--r-- 3 vagrant vagrant 959680230 2015-09-23 15:24 input/PostHistory.xml
-rw-r--r-- 3 vagrant vagrant 2988058 2015-09-23 15:24 input/PostLinks.xml
-rw-r--r-- 3 vagrant vagrant 654708460 2015-09-23 15:24 input/Posts.xml
-rw-r--r-- 3 vagrant vagrant 228933 2015-09-23 15:24 input/Tags.xml
-rw-r--r-- 3 vagrant vagrant 66168226 2015-09-23 15:24 input/Users.xml
-rw-r--r-- 3 vagrant vagrant 144769493 2015-09-23 15:25 input/Votes.xml
Next step is to run a map only map reduce job to convert the XML files to Avro. I have documented that process previously.
After running that job, there should be a bunch of avro files in the output directory:
[vagrant@standalone]$ hadoop fs -ls output
Found 33 items
-rw-r--r-- 3 vagrant vagrant 0 2015-09-23 18:28 output/_SUCCESS
-rw-r--r-- 3 vagrant vagrant 22478186 2015-09-23 18:28 output/badges-m-00016.avro
-rw-r--r-- 3 vagrant vagrant 111145713 2015-09-23 18:21 output/comments-m-00001.avro
-rw-r--r-- 3 vagrant vagrant 35046851 2015-09-23 18:28 output/comments-m-00015.avro
-rw-r--r-- 3 vagrant vagrant 1330504 2015-09-23 18:28 output/postlinks-m-00018.avro
-rw-r--r-- 3 vagrant vagrant 141311747 2015-09-23 18:25 output/posts-m-00009.avro
-rw-r--r-- 3 vagrant vagrant 140294479 2015-09-23 18:25 output/posts-m-00010.avro
-rw-r--r-- 3 vagrant vagrant 139147884 2015-09-23 18:26 output/posts-m-00011.avro
-rw-r--r-- 3 vagrant vagrant 138334514 2015-09-23 18:27 output/posts-m-00012.avro
-rw-r--r-- 3 vagrant vagrant 120587513 2015-09-23 18:27 output/posts-m-00013.avro
-rw-r--r-- 3 vagrant vagrant 119501 2015-09-23 18:28 output/tags-m-00019.avro
-rw-r--r-- 3 vagrant vagrant 78929530 2015-09-23 18:28 output/users-m-00014.avro
-rw-r--r-- 3 vagrant vagrant 83896117 2015-09-23 18:20 output/votes-m-00000.avro
A Hive table points at a directory, and we want to to have a table for each of the original Stack Exchange files, so the next step is to create a simple directory structure and moved the files in the output directory into the correct place. The following bash script should do the trick:
BASE_DIR=/user/vagrant/hive/stackoverflow
dirs=("users" "posts" "comments" "tags" "votes" "badges" "postlinks")
for i in "${dirs[@]}"
do
hadoop fs -mkdir -p ${BASE_DIR}/${i}
hadoop fs -mv output/${i}*.avro ${BASE_DIR}/${i}/
done
Now that the data is converted and moved into place, all that is left is to create some Hive tables. Current Hive versions (>= 0.14 I think) make creating tables over Avro files very simple - you don't even need to specify the Avro schema in the table definition, as it is derived from columns in the create table statement. So this step is as simple as create table statements, eg:
CREATE EXTERNAL TABLE users(
id string,
reputation string,
creationdate string,
displayname string,
lastaccessdate string,
websiteurl string,
location string,
aboutme string,
views string,
upvotes string,
downvotes string,
age string,
accountid string,
profileimageurl string
)
STORED AS AVRO location '/user/vagrant/hive/stackoverflow/users';
Get the full script on Github.
Now you should be able to run queries against the data in Hive:
hive> select count(*) from posts;
Query ID = vagrant_20150923184848_60c6c182-9b95-438d-875d-0ed433b4c7ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1442929182717_0016, Tracking URL = http://standalone:8088/proxy/application_1442929182717_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1442929182717_0016
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2015-09-23 18:48:43,080 Stage-1 map = 0%, reduce = 0%
2015-09-23 18:48:53,565 Stage-1 map = 12%, reduce = 0%, Cumulative CPU 6.78 sec
2015-09-23 18:48:54,599 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 8.02 sec
2015-09-23 18:49:05,033 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 14.75 sec
2015-09-23 18:49:07,105 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 16.33 sec
2015-09-23 18:49:15,493 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 21.12 sec
2015-09-23 18:49:19,695 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 22.33 sec
MapReduce Total cumulative CPU time: 22 seconds 330 msec
Ended Job = job_1442929182717_0016
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 22.33 sec HDFS Read: 680169669 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 330 msec
OK
537109
Time taken: 46.837 seconds, Fetched: 1 row(s)
You may have noticed I have cheated in a major area - Every column in every table is a string, including the dates. This would be much more useful if those were true Hive timestamp columns.