The other side of the moon

/bb|[^b]{2}/
Never stop Grokking

Wednesday, September 30, 2009

A couple of days ago I posted about scaling writes in mysql. I didn't say much about read performance in that post because a) it was irrelevant at the time, and b) there are thousands of articles all over the web that already cover read performance.

In this post I'm going to cover some of the things that I did to improve read performance for my own application. It may be relevant to others, however, you'd still need to read a whole bunch of other articles to understand MySQL read performance.

Looking at our access patterns, it turned out that there were two classes of read queries.
1. Reads to build the daily summaries
2. Reads from the summary tables in response to user actions
The former dealt with far more data at one go, but was only run once. Queries for this pattern were slow depending on how many rows were touched. Small resultsets came back in milliseconds while larger resultsets (some over a million rows) took several minutes to return. The latter pattern dealt with small amounts of data, but happened far more frequently. These queries needed to return in around 10ms.

Also note from my last post that inserts handled 40,000 rows per query, which meant that each query took about 4.5seconds to run.

Now why is all of this relevant? It's relevant because it renders your slow query log mostly useless. My slow query log jumps to about 300GB fairly quickly, so you need log rotation implemented. We can, however turn slow query logging on and off at run time using the slow_query_log global system variables, however since these variables are global, we need to worry about a few things.
1. Make sure you set it back on when your script finishes, even if the script crashes
2. Any other queries run while your slow script is running will not be logged even if they are slow
There's nothing we can do about the latter (nothing I can think of anyway). For the former, I prefer to use a wrapper script that turns off slow query logging, then calls my slow script, and then turns it back on when the the script has terminated (successfully or unsuccessfully). This ensures that my slow query log has mostly queries that I should and can optimise. See the MySQL Slow Query Log documentation for more information on what to do and how to do it.

Now, I mentioned in my previous post that we pushed our boxes up from 4GB RAM to 16GB RAM. This left a bit free after allocating enough to the innodb_buffer_pool. I figured that we could use this to improve read performance. My first thought was to use this for the query cache. All past tests had shown that the query cache improves read performance quite a lot. Unfortunately, these tests assume that you have a single database server or use some kind of affinity to make sure that multiple queries go to the same host.

This is not such a good idea with a multi-box set up for BCP and load balancing. There's also the way in which the query cache keys queries which not every developer understands, and this can lead to unexpected results. I wasn't too concerned about this since I was in control over every query that went into the system, but I may not always be maintaining this system. I decided that the best option was to turn off the query cache and turn it on on a per query basis using the SQL_CACHE directive in my queries. Instead, I use a frontend cache similar to memcached. The guys at the MySQL Performance Blog also have similar recommendations wrt query cache, so go read their blog for a more detailed analysis.

The second thing I did was to create tables with redundant information. I call them cache tables. I store information in there while I'm building the main tables that will eventually speed up creating the summary tables. The data in there is quite simple. For example, I have a table that contains an approximate count of rows of each type that I need to summarise. That way I can schedule summarisation using the Shortest Job First algorithm. The result is that in 50% of the time, 97% of all summaries are done and most users can start using that data. Something else I haven't done yet, but may implement soon is to let the script that summarises data run two instances in parallel, one of each slave, and one running the shortest jobs first while the other runs the longest jobs first, or some similar scheduling algorithm. The ideal result would be if it took 50% of the time to run.

The final optimisation for handling summaries was bulk queries. INSERTs, UPDATEs and SELECTs can all be batched and sometimes this can get you much better performance than running single queries. For INSERTs, I developed a method using ON DUPLICATE KEY UPDATE to INSERT and UPDATE multiple rows at once. The query looks something like this:
 INSERT INTO table (        key_field, f1, f2, f3    ) VALUES (        key1, f11, f21, f31    ), (        key2, f12, f22, f32    ), ...    ON DUPLICATE KEY UPDATE        f1 = IF(key_field=key1, f11, IF(key_field=key2, f12, IF(key_field=key3, f13, ...))),        f2 = IF(key_field=key1, f21, IF(key_field=key2, f22, IF(key_field=key3, f23, ...))),        f3 = IF(key_field=key1, f31, IF(key_field=key2, f32, IF(key_field=key3, f33, ...)))
As you can see the query gets quite complicated as the number of rows grows, but you never write this query by hand. It's generated through code in your language of choice. The only thing you have to worry about is making sure the total query size stays below your max tcp packet size. Also longer queries take longer to parse. I restrict it to about 100 rows per insert/update.

Now, it's quite likely that I need to insert/update far more than 100 rows, which means the query parser needs to run for each batch. To get around this, I use a prepared statement with a heck of a lot of question marks in it. I'll leave it as an excercise for you to figure out what to pass to it. The real trick comes on the last batch. It's unlikely that I'll have an exact multiple of 100 records to be inserted, so the last batch may have fewer than 100 records. I have two choices at this point.
1. Create a new prepared statement with the number of records I need
2. Pad the current statement with extra copies of the last row
Neither method has had any advantage over the other, so I prefer the former since it sends less data over the wire at the cost of one extra query parse.

Bulk selects are far similar. It basically means that if I'm going to have to operate on a bunch of records one at a time, then it's faster to select them all at once, store them in an array and operate on the array rather than selecting them one at a time. This, of course, costs memory, and it is possible to use up all the RAM on the system doing something like this. It's happened several times. With experience you learn where to draw the line for your application.

Now for the user queries, I did not optimise too much. I again went with data partitioning, also by time, but this time by month. Our access patterns showed that most queries were for data in the last one month, so by partitioning the summary tables by month, it meant that we only had to query one or two partitions at any time. The primary key was designed to either return the exact results the user wanted, or narrow the search down to a small set that could be filtered either through a DB scan, or in the application itself. Partitions ensured that in the worst case we'd have to do a full partition scan and not a full table scan.

This is by no means the fastest design. It is optimised to speed up the slowest part of the system, ie, writes, but reads don't quite go out of the window as a result.

Monday, September 28, 2009

Scaling writes in MySQL

We use MySQL on most of our projects. One of these projects has a an access pattern unlike any other I've worked on. Several million records a day need to be written to a table. These records are then read out once at the end of the day, summarised and then very rarely touched again. Each record is about 104 bytes long (thre's one VARCHAR column, everything else is fixed), and that's after squeezing out every byte possible. The average number of records that we write in a day is 40 million, but this could go up.

A little bit about the set up. We have fairly powerful boxes with large disks using RAID1/0 and 16GB RAM, however at the time they only had 4GB. For BCP, we have a multi-master set up in two colos with statement level replication. We used MySQL 5.1.

My initial tests with various parameters that affect writes showed that while MyISAM performed slightly better than InnoDB while the tables were small, it quickly deteriorated as the table size crossed a certain point. InnoDB performance deteriorated as well, but at a higher table size. The table size turned out to be related to the innodb_buffer_pool_size, and that in turn was capped by the amount of RAM we had on the system.

I decided to go with InnoDB since we also needed transactions for the summary tables and I preferred not to divide my RAM between two different engines. I stripped out all indexes, and retained only the primary key. Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates. This small point was crucial for BCP, because it meant that we did not have to keep track of which records had already been inserted. If something crashed, we could just reinsert the last 30 minutes worth of data, possibly into the secondary master, and not have any duplicates at the end of it. I used INSERT IGNORE to get this done automatically.

Now to get back to the table size limit that we were facing. Initial tests showed that we could insert at most 2100 records per second until the table size got to a little over the innodb_buffer_pool_size and at that point it degraded fairly rapidly to around 150 records per second. This was unacceptable because records were coming in to the system at an average rate of 1000 per second. Since we only needed to read these records at the end of the day, it was safe to accumulate them into a text file and periodically insert them in bulk. I decided to insert 40,000 records at one time. The number I chose was arbitrary, but later tests that I ran on batches of 10K, 20K and 80K showed no difference in insert rates. With batch inserts, we managed to get an insert rate of 10,000 records per second, but this also degraded as soon as we hit the limit going down to 150 records per second.

System stats on the database box showed that the disk was almost idle for most of the run and then suddenly shot up to 90-100% activity once we hit this limit, so it was obvious that at this point, the DB was exchanging data between buffers and disk all the time.

At this point, someone suggested that we try partitioning, which was available in MySQL 5.1. My first instinct was to partition based on the primary key so that we could read data out easily. However, reads weren't really our problem since we had no restriction on how fast they needed to be (at least not as much as writes). Instead, I decided to partition my table based on the pattern of incoming data.

The first part was obvious, use a separate table for each day's data. On a table of this size, DROP TABLE is much faster than DELETE From <table> Where ..., and it also reclaims lost space. I should mention at this point that we used file_per_table as well to make sure that each table had its own file rather than use a single innodb file.

Secondly, each table was partitioned on time. 12 partitions per day, 2 hours of data per partition. The MySQL docs for Partitioning were quite useful in understanding what to do. The command ended up looking like this:
CREATE TABLE (
...
) PARTITION BY RANGE( ( time DIV 3600 ) MOD 24 ) (
Partition p0 values less than (2),
Partition p1 values less than (4),
Partition p2 values less than (6),
Partition p3 values less than (8),
Partition p4 values less than (10),
Partition p5 values less than (12),
Partition p6 values less than (14),
Partition p7 values less than (16),
Partition p8 values less than (18),
Partition p9 values less than (20),
Partition p10 values less than (22),
Partition p11 values less than (24)
); 
The time field is the timestamp of incoming records, and since time always moves forward (at least in my universe), this meant that I would never write to more than 2 partitions at any point in time. Now, a little back of the envelope calculations:
44M x 102 bytes = approx 4.2GB
2x for InnoDB overhead = approx 8.4GB
+10% for partitioning overhead = 9.2GB
/12 partitions = approx 760MB per partition 
This turned out to be more or less correct. In most cases total table size ranges between 8-10GB, sometimes it goes up to 13GB. Partition sizes range from less than 700MB to over 1GB depending on the time of day. With 4GB of RAM, we had an innodb_buffer_pool set at 2.7GB, which was good enough to store two partitions, but not good enough to work on any other tables or do anything else on the box. Boosting the RAM to 16GB meant that we could have a 12GB buffer pool, and leave 4GB for the system. This was enough for 2 partitions, even if the total number of records went up, and we could work on other tables as well.

After partitioning, tests showed that we could sustain an insert rate of 10K rows per second for some time. As the table size grew past 10 million records, the insert rate dropped to about 8500 rows per second, but it stayed at that rate for well over 44 million records. I tested inserts up to 350 million records and we were able to sustain an insert rate of around 8500 rows per second. Coincidentally, during Michael Jackson's memorial service, we actually did hit an incoming rate of a little over 8000 records per second for a few hours.

One more BotE calculation:
8500 rows per second  x  86400 seconds per day = 734.4 Million records per day
Considering that before this system was redesigned it was handling about 7 Million records per day, I'd say that we did pretty well.

Update: If you want to see charts, they're in my ConFoo presentation on slideshare.

...===...