21 April 2015

Comparing Sequence Files, ORC Files and Parquet Files

Back when I started working with Hadoop, I did some benchmarks around different file types, mainly thinking about how much they compressed the data and whether they were splittable formats or not. I quickly learned that just loading files as gzipped text was not a good idea thanks to it being an non splittable format. Eventually we settled on using compressed sequence files (using gzip) in our project, which was probably not the optimal choice.

Since then, both Parquet and ORC files have been getting a lot of press, and I though it was about time I had a good look at them.

Test Platform and Plan

I wanted to do some basic checks on each of the file types using real-world data from our application. I did not make any effort to change any of the default settings, except to set the PARQUET_COMPRESSION_CODEC=snappy; (on my system it seemed to default to NONE). My main areas of interest are how big the resulting files become, and how much CPU is consumed creating them and later querying them.

I ran these tests on Cloudera Hadoop version 5.2.1 and hive 0.13. One thing to note, is that in this version Parquet does not support the Timestamp data type, which will hurt its compression statistics. All of my test tables have at least one Timestamp column. Hopefully I can re-run these tests once my cluster is upgraded.

Log Table

The first table I looked at holds application log data. Typically a day of data is about 80GB, stored as a gzipped compressed Sequence file.

I created one day of data using both ORC and PARQUET:

SEQUENCE FILE: 80.9 G created in 1344 seconds, 68611 CPU seconds
ORC FILE     : 33.9 G created in 1710 seconds, 82051 CPU seconds
PARQUET FILE : 49.3 G created in 1421 seconds, 86263 CPU seconds

Both ORC and Parquet compress much better than Sequence files, with ORC the clear winner, however it does take slightly more CPU to create the ORC file. It is interesting to note, and not really surprising that creating Sequence files is much more efficient than either of the other two formats.

The next thing to be concerned with is query performance. Additional overhead creating the files is easy to accept if queries benefit over and over again. Even ignoring the special features built into ORC and Parquet, I expect both ORC and Parquet will do much better than Sequence files due to the large difference in file sizes.

Simple count(*)

SEQUENCE FILE:  202 seconds; 9316 CPU (second run 242 seconds)
ORC FILE     :  148 seconds; 1839 CPU (second run 122 seconds)
PARQUET FILE :  139 seconds; 2801 CPU (second run 117 seconds)

Filter on 1 column and group by

SEQUENCE FILE:  340 seconds; 14318 CPU (second run 373 seconds)
ORC FILE     :  165 seconds; 2978  CPU (second run 157 seconds)
PARQUET FILE :  165 seconds; 4490  CPU (second run 171 seconds )

Filter on 3 columns plus a lookup join

SEQUENCE FILE:  526 seconds; CPU 14031 (second run 491 seconds)
ORC FILE     :  201 seconds; CPU 5329  (second run 204 seconds)
PARQUET FILE :  240 seconds; CPU 8797  (second run 312 seconds)

In terms of CPU, ORC is the clear winner in all these tests, and it is just about edging it in response time too. As the runtime can be quite variable on a Hadoop Cluster, I am more concerned with the CPU used as a performance benchmark.

Wide Transaction Table with Array of Structs

Another important table in our application is a very wide table that also makes use of an array of structs embedded in each row. This table has many fewer rows than the log table, coming in at about 1.5GB a day:

SEQUENCE FILE: 1.5 G
ORC FILE     : 835.9 M created in 414 seconds; 1705 CPU seconds
PARQUET FILE : 919.3 M created in 290 seconds; 1510 CPU seconds

Select count(*)

SEQUENCE FILE: 76 seconds;  109 CPU (second run 68 seconds)
ORC FILE     : 70 seconds;  27  CPU (second run 73 seconds)
PARQUET FILE : 69 seconds;  42  CPU (second run 58 seconds)

Expand lateral view, filter and count

SEQUENCE FILE: 98 seconds; 240 CPU (second run 80 seconds)
ORC FILE     : 85 seconds; 114 CPU (second run 77 seconds)
PARQUET FILE : 84 seconds; 196 CPU (second run 94 seconds)

Again, in both these test ORC seems to be the winner for queries, but is the most costly file to create.

Transaction Table Copied From RDBMS

This is a fairly typical database table, storing about 1.1GB compressed each day:

SEQUENCE FILE: 1.1 G
ORC FILE     : 667.0 M created in 202 seconds; 989 CPU seconds
PARQUET FILE : 691.0 M created in 202 seconds; 853 CPU seconds

I didn't run any queries on this table, but again ORC creates the smallest files but with the largest overhead at file creation time.

Conclusion

Clearly you should not use Sequence files to store Hive tables. While they are efficient to create, the additional disk space and CPU overhead when reading them is a heavy cost.

Based on this quick set of tests, ORC files win for me. They are more expensive to create than Parquet files, but the compression techniques are better for my data, along with lower CPU overhead for my test queries.

There is one more thing to consider - at this time, Impala cannot make use of ORC files, so it may make sense to go with Parquet if that is something you will need now or in the future.