18 January 2013

Calculating Velocity Scores at the Speed of Redis

To really get a feel for a technology, you have to at least build a proof of concept application using it. Building a simple application that uses Redis as an in memory hash table isn't very interesting, so I was searching for something that would test some of the more advanced features of Redis.

Velocity Check

A problem that came to mind is implementing a Velocity Check that could be used as part of a fraud checking system. According to the first result I got on google, the definition of a Velocity Check is:

The intent of velocity of use is to look for suspicious behaviour based on the number of associated transactions a consumer is attempting. It works based on counting the number of uses of a data element within a predetermined time frame. The theory is the higher the number of uses on a data element (e.g., credit cards) in a predefined time period (e.g., 24 hours), the higher the risk of taking an order.

So we if consider customers ordering goods from a company, there are various 'data items' we might like to track - delivery address, email and credit card number. If the same value keeps occurring over and over in an unrealistic time frame, it may suggest fraud.

Requirements

From the description above, and a bit more reading around the linked web page, the minimum set of requirements for a multi tenant Velocity scoring engine are:

Each tenant can track many values
Each value should be tracked for a defined time period
To calculate a score, a rule and data value will be supplied, and the Velocity Engine must count how many times the value occurred in the defined time period.

There are obviously more requirements to think about around on-boarding accounts to use the system and configuring the rules etc, but I don't want to think about that yet. The goal of this post is to explore the core scoring engine, assuming we have a way to configure the rule definitions.

Time To Live

As this is a multi tenant system, each user may configure different time periods - one may want to track things over a 60 minute period, while another may want to track over a 6 month period. If we were to implement this in a relational database, we need to think about how we would delete the old data to keep the size of the database in check, probably by looking up each rule and its retention, and then deleting any data held for the rule that is past the retention time.

With Redis, we can set a time to live (TTL) on a key. If we set the time to live on each data item as we set it, then it will automatically be removed from the database when that time has passed, so that is something to keep in mind.

Many Distinct Values

If you consider transactions that are not fraudulent, for every field tracked there will be a large number of distinct values, and only a few values that actually repeat. The goal of the velocity scoring engine is to find those repeating values. We also have to keep the data retention requirement in mind, and set the TTL appropriately.

Modelling In Redis

The following are the pieces of data we need to track:

For each rule, we will have many (possibly millions) of data values, only a few of which will repeat.
Each data value will have an associated transaction date, and there is a retention period defined by the rule
A Velocity score for a rule is calculated by counting how many times a given value been seen within the time window

After thinking about this for a while, I decided that Redis Sorted Sets would be a perfect for modelling this problem. We are going to need 1 sorted set per rule, and 1 sorted set per data item.

Sorted Sets

A sorted set is like a list, only each member can only appear in it once. A member is simply a string, so a sorted set is a list of unique strings.

Associated with each member is a numeric score, which controls where the member is sorted in the list, biggest score first, smallest score last. Members with the same score are sorted in binary order.

We can create a sorted set for each data item, where the key is "ruleid:datavalue" and each member is a transaction_id, and give each member a score which is the transaction date in unix timestamp format. That will give us a series of keys that look like:

1546:joeblogs@nowhere.com => [trans_id_234(Timestamp)]
1546:james@nowhere.com    => [trans_id_235(Timestamp)]
1546:cheeky@nowhere.com   => [trans_id_237(Timestamp) trans_id_236(Timestamp) trans_id_001(Timestamp)]

Notice how the third set has 3 transactions in it, suggesting there may be something strange about that email address.

Purging The Data

There are two things to think about around purging data. In the majority of cases, when a data value is seen, it will not be seen again within the retention period for a rule. So if the TTL is set on the sorted set when it is created, it will be automatically removed when necessary.

There will be some data values that repeat, possibly a lot - in this case, the TTL should be reset each time a data value is added. To stop the list getting too long, we need to figure out how to remove elements from the list that have gotten too old.

Redis has commands to purge elements from a sorted set where the members scores fall within a range. As the scores here are timestamps, it is pretty easy to purge members older than the retention time.

Indexing the Data Values

It may be useful to have a way of viewing the latest data items added to a rule, for this we can use another sorted set. This one would have a key of the rule_id, and the sorted set would contain the data value, along with the date of the transaction in unix timestamp format as the score. In this way the newest data values will be at the start of the list. We can also purge the old values out of the list in the same way as the other data lists. This list will potentially contain millions of members, but it is not strictly necessary to make the application work.

Show me some code already!

Once you figure out how to model the problem in Redis, the code doesn't really come to much, which is pretty nice. The method below can be used to add a new record to the database. Obviously this could be coded in a much better OO style, but putting it all into a single method like this makes it easier see what is going on.

def load_data_value(rule_id, retention_time, data_value, transaction_id, transaction_date)
  data_key = "#{rule_id}:#{data_value}"

  @redis.multi do
    # Create a sorted set for each value attached to a rule
    # The key is rule_id:rule_value - this is what we will lookup
    # to check the velocity, summing up the last N records, determined
    # by the score. 
    @redis.zadd data_key, transaction_date.to_i, transaction_id

    # Test if the set needs any items purged from it as they are past
    # its TTL
    @redis.zremrangebyscore data_key, "-inf", Time.now.to_i - retention_time

    # Each time an element is added to the set, set the TTL on it to be
    # the retention time for the set
    @redis.expire data_key, retention_time

    # Optionally, create a lset to index all the data values, and trim
    # the number of items in it
    @redis.zadd rule_id, transaction_date.to_i, data_value
    @redis.zremrangebyscore rule_id, "-inf", Time.now.to_i - retention_time
  end
end

If the records are loaded in the format above, then calculating a score immediately after adding an item is pretty simple too:

def calculate_score(rule_id, data_value)
  @redis.zcard "#{rule_id}:#{data_value}"
end

This makes sense, as generally you will want to add a value and calculate it's score at the same time. The Redis zcard command simply counts the number of values in the set with the given key or returns zero if the key doesn't exist. This code is very much proof-of-concept code. To calculate a real Velocity Score, you will need to check more than 1 rule and sum them all up, but my goal here is to show how Redis can be used to solve the problem, so I am keeping it simple.

Performance

The thing I am really interested in is how this solution performs. To test this out, I am going to assume a transaction_id of 10 bytes and a data value of 100 random characters. I will also assume there are 1000 rules with id's between 10,000 and 11,000, with them all having a TTL of 3600 seconds. The following Ruby code can be used to simulate this scenario:

require 'redis'

class SimulateVelocity

  CHARS = ['a'..'z'].map{|r|r.to_a}.flatten

  def initialize(host)
    @redis = Redis.new(:host => host)
  end

  def run_operation
    rule_id = random_number(1000) + 10000
    retention_time   = 3600
    transaction_id   = random_string(10)
    data_value       = random_string(100)
    transaction_date = Time.now

    load_data_value(rule_id,
                    retention_time,
                    data_value,
                    transaction_id,
                    transaction_date)

    calculate_score(rule_id,
                    data_value)
  end

  def load_data_value(rule_id, retention_time, data_value, transaction_id, transaction_date)
    data_key = "#{rule_id}:#{data_value}"

    @redis.multi do
      # Create a sorted set for each value attached to a rule
      # The key is rule_id:rule_value - this is what we will lookup
      # to check the velocity, summing up the last N records, determined
      # by the score.
      @redis.zadd data_key, transaction_date.to_i, transaction_id

      # Test if the set needs any items purged from it as they are past
      # its TTL
      @redis.zremrangebyscore data_key, "-inf", Time.now.to_i - retention_time

      # Each time an element is added to the set, set the TTL on it to be
      # the retention time for the set
      @redis.expire data_key, retention_time

      # Optionally, create a lset to index all the data values, and trim
      # the number of items in it
      @redis.zadd rule_id, transaction_date.to_i, data_value
      @redis.zremrangebyscore rule_id, "-inf", Time.now.to_i - retention_time
    end
  end

  def calculate_score(rule_id, data_value)
    @redis.zcard "#{rule_id}:#{data_value}"
  end

  private

  def random_string(length)
    str = ''
    1.upto(length) do
      str << CHARS[rand(CHARS.size)]
    end
    str
  end

  def random_number(max_value)
    rand(max_value)
  end

end

sv = SimulateVelocity.new('redishost')
while(1) do
  puts sv.run_operation
end

With this code, I can plug it into some simple benchmarking code I have and see how many scores we can calculate per second.

I get a sustained rate of about 12K - 13K scores per second across 15 threads. In this setup, each score consists of 6 Redis calls, which means Redis is running at about 72K requests per second - not too bad on a single CPU.

If I remove the optional additional index for all the data values, performance jumps to 15K - 16K per second.

If you really need to score more than 10K velocity scores per second, which is almost 1 billion per day, it would be pretty trivial to design a system where different accounts are held on different Redis instances. Until Redis Cluster comes out, this would need to be done in the application.

What About Memory

Doing 12 - 16K requests per second is one thing, but what about memory? Remember, Redis requires the entire data set to be in memory.

Some stats I gathered on memory:

100 byte data values with additional index - 490 bytes per key
50 byte data values with additional index - 393 bytes per key
100 byte data values, no index - 270 bytes per key
50 byte data values, no index - 227 bytes per key

Storing 40M records is going to cost between 18GB and 8.5GB of memory depending on your data and retention time for each rule, so that is something to think before deciding if this technique is feasible or not.

This post is part of a series:

Previous in series | Series Index | Next in series