A little bit about the set up. We have fairly powerful boxes with large disks using RAID1/0 and 16GB RAM, however at the time they only had 4GB. For BCP, we have a multi-master set up in two colos with statement level replication. We used MySQL 5.1.
My initial tests with various parameters that affect writes showed that while MyISAM performed slightly better than InnoDB while the tables were small, it quickly deteriorated as the table size crossed a certain point. InnoDB performance deteriorated as well, but at a higher table size. The table size turned out to be related to the
innodb_buffer_pool_size, and that in turn was capped by the amount of RAM we had on the system.
I decided to go with InnoDB since we also needed transactions for the summary tables and I preferred not to divide my RAM between two different engines. I stripped out all indexes, and retained only the primary key. Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates. This small point was crucial for BCP, because it meant that we did not have to keep track of which records had already been inserted. If something crashed, we could just reinsert the last 30 minutes worth of data, possibly into the secondary master, and not have any duplicates at the end of it. I used
INSERT IGNOREto get this done automatically.
Now to get back to the table size limit that we were facing. Initial tests showed that we could insert at most 2100 records per second until the table size got to a little over the
innodb_buffer_pool_sizeand at that point it degraded fairly rapidly to around 150 records per second. This was unacceptable because records were coming in to the system at an average rate of 1000 per second. Since we only needed to read these records at the end of the day, it was safe to accumulate them into a text file and periodically insert them in bulk. I decided to insert 40,000 records at one time. The number I chose was arbitrary, but later tests that I ran on batches of 10K, 20K and 80K showed no difference in insert rates. With batch inserts, we managed to get an insert rate of 10,000 records per second, but this also degraded as soon as we hit the limit going down to 150 records per second.
System stats on the database box showed that the disk was almost idle for most of the run and then suddenly shot up to 90-100% activity once we hit this limit, so it was obvious that at this point, the DB was exchanging data between buffers and disk all the time.
At this point, someone suggested that we try partitioning, which was available in MySQL 5.1. My first instinct was to partition based on the primary key so that we could read data out easily. However, reads weren't really our problem since we had no restriction on how fast they needed to be (at least not as much as writes). Instead, I decided to partition my table based on the pattern of incoming data.
The first part was obvious, use a separate table for each day's data. On a table of this size,
DROP TABLEis much faster than
DELETE From <table> Where ..., and it also reclaims lost space. I should mention at this point that we used
file_per_tableas well to make sure that each table had its own file rather than use a single innodb file.
Secondly, each table was partitioned on time. 12 partitions per day, 2 hours of data per partition. The MySQL docs for Partitioning were quite useful in understanding what to do. The command ended up looking like this:
CREATE TABLE ( ... ) PARTITION BY RANGE( ( time DIV 3600 ) MOD 24 ) ( Partition p0 values less than (2), Partition p1 values less than (4), Partition p2 values less than (6), Partition p3 values less than (8), Partition p4 values less than (10), Partition p5 values less than (12), Partition p6 values less than (14), Partition p7 values less than (16), Partition p8 values less than (18), Partition p9 values less than (20), Partition p10 values less than (22), Partition p11 values less than (24) );The
timefield is the timestamp of incoming records, and since time always moves forward (at least in my universe), this meant that I would never write to more than 2 partitions at any point in time. Now, a little back of the envelope calculations:
44M x 102 bytes = approx 4.2GB 2x for InnoDB overhead = approx 8.4GB +10% for partitioning overhead = 9.2GB /12 partitions = approx 760MB per partitionThis turned out to be more or less correct. In most cases total table size ranges between 8-10GB, sometimes it goes up to 13GB. Partition sizes range from less than 700MB to over 1GB depending on the time of day. With 4GB of RAM, we had an innodb_buffer_pool set at 2.7GB, which was good enough to store two partitions, but not good enough to work on any other tables or do anything else on the box. Boosting the RAM to 16GB meant that we could have a 12GB buffer pool, and leave 4GB for the system. This was enough for 2 partitions, even if the total number of records went up, and we could work on other tables as well.
After partitioning, tests showed that we could sustain an insert rate of 10K rows per second for some time. As the table size grew past 10 million records, the insert rate dropped to about 8500 rows per second, but it stayed at that rate for well over 44 million records. I tested inserts up to 350 million records and we were able to sustain an insert rate of around 8500 rows per second. Coincidentally, during Michael Jackson's memorial service, we actually did hit an incoming rate of a little over 8000 records per second for a few hours.
One more BotE calculation:
8500 rows per second x 86400 seconds per day = 734.4 Million records per dayConsidering that before this system was redesigned it was handling about 7 Million records per day, I'd say that we did pretty well.
Update: If you want to see charts, they're in my ConFoo presentation on slideshare.