20 September 2013

Masking data in Hive

I had a problem recently where I needed to mask a bunch of sensitive production data to create a database performance test environment. By chance the data was both in Oracle and in Hadoop.

This application just works on strings - it doesn't really care what the format of the strings is. Therefore to mask the data, so long as all occurrences of the same string turn into the same other string, the application will work perfectly.

To turn one string into another string the obvious choice is to use a one way function, such as SHA1 or SHA256. This on its own is not overly secure, as someone could reverse engineer some of my sensitive data using a brute force attack. Adding a salt to the hash would make it much more secure.

Then I recalled something I had heard about on security now some time ago called HMAC. It adds a secret key (which is much like a salt) to the data and hashes it twice.

If I generated a random key for the HMAC function, masked all my data and then throw away the key, there should be no way for anyone to reverse engineer the original data from the hash of it.

Oracle doesn't have a built in HMAC function, but I did a test on some data using the built in sha1 function (which is part of the dbms_crypto package). It used a lot of CPU and took quite a long time to do the hashing, not a great thing to do on your production database.

Then I tried Hive - it doesn't have a builtin HMAC UDF either, but building on my last post, it was pretty easy to create a UDF to HMAC some data:

package com.sodonnel.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;

import org.apache.commons.codec.binary.Hex;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;


public class Hmac extends UDF
{
  public String evaluate(String key, String message)
    throws java.security.InvalidKeyException, java.security.NoSuchAlgorithmException {

    // if a null or empty string is input, return empty string
    if ((null == message) || (message.isEmpty())) {
      return "";
    }

    SecretKeySpec keySpec = new SecretKeySpec(key.getBytes(),"HmacSHA1");

    Mac mac = Mac.getInstance("HmacSHA1");
    mac.init(keySpec);
    byte[] rawHmac = mac.doFinal(message.getBytes());

    return Hex.encodeHexString(rawHmac);

  }
}

For this to compile you need to have the apache.commons.codec jar on the CLASSPATH. In the Cloudera install I am using, it is at:

/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hadoop/lib/commons-codec-1.4.jar

Using this, I was able to Hash about 30M rows, with 7 hashes per row in about 30 seconds, which is not too shabby at all.

blog comments powered by Disqus