23 January 2013

One Large Redis or Many Smaller Shards?

After experimenting with a simple proof of concept Redis backed application, I turned my attention to some of the more practical aspects on attempting to run an application like it in production.

For the use case I have in mind for Redis, I will need to store a lot of data, potentially several 100GB. It is possible to get machines that have 256GB and more RAM - we use some of them today to host Oracle databases - but is it sensible to run a single Redis process that is 100GB or more in size?

Startup Time

The problem with Redis, is that the entire dataset needs to live in memory - when running it cannot read any data from disk. Due to this limitation, before Redis can accept any connections, it needs to load the entire database from disk into memory. So how long would this take?

I created a Redis instance and populated it with about 5.2GB of random data. I created a snapshot RDB file on disk, which was about 4.8GB in size. My keys and data values were randomly generated, so the RDB file did not compress much. I have heard that the RDB file for many real world applications can be about 10x smaller than the in memory database size.

Start up time for this 5.2GB database was 1min 3seconds. A guy on the Redis mailing list stated that his real world 50GB Redis instance took 20 minutes to start up, running on EC2.

I ran my test on a 24GB 3.57Gz Intel Xeon box, with the datafile stored on SAN, so maybe it outperforms an EC2 box. Either way, I am looking at about 20 - 40 minutes to start up a 100GB Redis instance.

Slaves

If you have slaves, then the start up time may be tolerable. Promote a slave to the master, and then re-point all your connections. How hard is it to create Redis slaves?

The way Redis creates slaves is as follows:

Master creates a snapshot of the entire database (an RDB file) on disk
Master transmits that file across the network to the slave, while buffering all new commands received on the master in memory for later sending to the slave.
Slaves load the RDB file. This will take about the same length of time it would take the master to start up from the RDB file.
Slave receives the stream of pending commands from the master
The slave is now online

With a very large Redis instance, it will take quite a long time to transmit the large RDB file to the slave. Then it will take the slave some time to load it. Meanwhile the master needs to keep a buffer of all the commands it received in the 20 - 40 minutes this all takes. If the master is receiving a lot of writes during this time, the buffer needed on the master may overflow, and the slave synchronization process will need to start again.

To make things worse, if a slave loses contact with the master, even for a few seconds, then it must reload the entire database from the master once again. There is no concept of an incremental refresh, although a partial resync feature is under development.

For me, bringing a new slave online is an even bigger problem than the start up time. It is going to take a long time to get a slave up and running, and if the master is very busy, it may be difficult to get the slave to sync at all.

Persistence

Redis offers two ways to ensure your data is available if you restart the process - RDB files and the Append Only File (AOF).

RDB Files

An RDB file is a consistent snapshot of entire Redis database. You can configure Redis to create a new RDB snapshot after the database receives a given number of changes, or you can kick them off at any time you wish manually. When running large Redis instances, there are a few potential problems:

The RDB file is generated by forking the Redis process, and the new process dumps its memory to disk. On very large instances, this fork may take a little time, maybe a second or slightly more in extreme cases, which blocks the entire Redis instance as it happens.
When Redis forks, it uses the standard Unix copy on write technique to mirror the parent processes memory, giving the forked process a copy of the original data without using any more memory. However, each memory page that changes in the parent process while the child process is still working results in that memory page being duplicated in the child process. If you have a large Redis instance, creating the RDB file is going to take some time. If the instance is under heavy write load while this happens, Redis will use quite a lot of extra memory until the RDB file is completed.
If you care about your data, then RDB files are really only good for consistent backups. If it takes 20 or more minutes to create a new RDB file, then at the best case, you can only secure any data that is over 20 minutes old using RDB files.

The third point here is crucial, and if you cannot afford to lose any data, then you need to look at the AOF.

Append Only File

If Redis is running in AOF mode, all operations are written to the AOF after they have completed changing the in memory data. If the AOF is set to sync to disk every second, then at most 1 second of data could be lost if the Redis instance is killed. The AOF does not significantly impact performance, so you should probably turn it on.

As every operation that changes the dataset is written to the AOF, it is going to get big quickly. If Redis needs to restart, it needs to read the AOF from the beginning, applying every change to get the database back to how it was before the shut down. This might take a very long time.

To make this start up time shorter, Redis allows the AOF to be rewritten periodically or on-demand. This works in a similar way to creating an RDB file. In simple terms, it creates an RDB file, while buffering all the writes that occurred since the process started. Then it creates a new AOF, appends the buffered writes and switches the AOFs around. Again on large instances this process is going to take a long time, making it problematic, and if you want Redis to start in a reasonable time, you probably need to rewrite the log a few times each day.

From a persistence point of view, the AOF will get the job done, and by allocating enough memory to buffer pending writes, it can be rewritten in a reasonable time. It will however make the start up time even longer.

Sharding Is Better

While none of issues I outlined here are show stoppers, they do make things difficult. My conclusion is that it makes much more sense to run many small Redis processes, probably on the same machine as a cluster. Backing up each of them is easier, starting a slave off a smaller master is easier and recreating the AOF is easier too. You also get more potential performance. One large Redis instance can only use a single CPU, while if you shard it across many instances you can use many CPUs.

One big negative, is if your application depends on the data all being in the same instance (for set intersect operations for example), you many not be able to shard, or at least not easily.

Another negative is that Redis doesn't currently offer built in clustering - it all has to be done in the application. Monitoring and running all those Redis processes is also more complex than a single instance.

That said, it can be done - the team at Craigslist documented their strategy which provides some interesting information in a real world application.

This post is part of a series:

Previous in series | Series Index | No further articles