A little bit about the set up. We have fairly powerful boxes with large disks using RAID1/0 and 16GB RAM, however at the time they only had 4GB. For BCP, we have a multi-master set up in two colos with statement level replication. We used MySQL 5.1.
My initial tests with various parameters that affect writes showed that while MyISAM performed slightly better than InnoDB while the tables were small, it quickly deteriorated as the table size crossed a certain point. InnoDB performance deteriorated as well, but at a higher table size. The table size turned out to be related to the
innodb_buffer_pool_size, and that in turn was capped by the amount of RAM we had on the system.I decided to go with InnoDB since we also needed transactions for the summary tables and I preferred not to divide my RAM between two different engines. I stripped out all indexes, and retained only the primary key. Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates. This small point was crucial for BCP, because it meant that we did not have to keep track of which records had already been inserted. If something crashed, we could just reinsert the last 30 minutes worth of data, possibly into the secondary master, and not have any duplicates at the end of it. I used
INSERT IGNORE to get this done automatically.Now to get back to the table size limit that we were facing. Initial tests showed that we could insert at most 2100 records per second until the table size got to a little over the
innodb_buffer_pool_size and at that point it degraded fairly rapidly to around 150 records per second. This was unacceptable because records were coming in to the system at an average rate of 1000 per second. Since we only needed to read these records at the end of the day, it was safe to accumulate them into a text file and periodically insert them in bulk. I decided to insert 40,000 records at one time. The number I chose was arbitrary, but later tests that I ran on batches of 10K, 20K and 80K showed no difference in insert rates. With batch inserts, we managed to get an insert rate of 10,000 records per second, but this also degraded as soon as we hit the limit going down to 150 records per second.System stats on the database box showed that the disk was almost idle for most of the run and then suddenly shot up to 90-100% activity once we hit this limit, so it was obvious that at this point, the DB was exchanging data between buffers and disk all the time.
At this point, someone suggested that we try partitioning, which was available in MySQL 5.1. My first instinct was to partition based on the primary key so that we could read data out easily. However, reads weren't really our problem since we had no restriction on how fast they needed to be (at least not as much as writes). Instead, I decided to partition my table based on the pattern of incoming data.
The first part was obvious, use a separate table for each day's data. On a table of this size,
DROP TABLE is much faster than DELETE From <table> Where ..., and it also reclaims lost space. I should mention at this point that we used file_per_table as well to make sure that each table had its own file rather than use a single innodb file.Secondly, each table was partitioned on time. 12 partitions per day, 2 hours of data per partition. The MySQL docs for Partitioning were quite useful in understanding what to do. The command ended up looking like this:
CREATE TABLE (The
...
) PARTITION BY RANGE( ( time DIV 3600 ) MOD 24 ) (
Partition p0 values less than (2),
Partition p1 values less than (4),
Partition p2 values less than (6),
Partition p3 values less than (8),
Partition p4 values less than (10),
Partition p5 values less than (12),
Partition p6 values less than (14),
Partition p7 values less than (16),
Partition p8 values less than (18),
Partition p9 values less than (20),
Partition p10 values less than (22),
Partition p11 values less than (24)
);
time field is the timestamp of incoming records, and since time always moves forward (at least in my universe), this meant that I would never write to more than 2 partitions at any point in time. Now, a little back of the envelope calculations:44M x 102 bytes = approx 4.2GBThis turned out to be more or less correct. In most cases total table size ranges between 8-10GB, sometimes it goes up to 13GB. Partition sizes range from less than 700MB to over 1GB depending on the time of day. With 4GB of RAM, we had an innodb_buffer_pool set at 2.7GB, which was good enough to store two partitions, but not good enough to work on any other tables or do anything else on the box. Boosting the RAM to 16GB meant that we could have a 12GB buffer pool, and leave 4GB for the system. This was enough for 2 partitions, even if the total number of records went up, and we could work on other tables as well.
2x for InnoDB overhead = approx 8.4GB
+10% for partitioning overhead = 9.2GB
/12 partitions = approx 760MB per partition
After partitioning, tests showed that we could sustain an insert rate of 10K rows per second for some time. As the table size grew past 10 million records, the insert rate dropped to about 8500 rows per second, but it stayed at that rate for well over 44 million records. I tested inserts up to 350 million records and we were able to sustain an insert rate of around 8500 rows per second. Coincidentally, during Michael Jackson's memorial service, we actually did hit an incoming rate of a little over 8000 records per second for a few hours.
One more BotE calculation:
8500 rows per second x 86400 seconds per day = 734.4 Million records per dayConsidering that before this system was redesigned it was handling about 7 Million records per day, I'd say that we did pretty well.

4 comments:
Thank you for sharing! I wonder if you could share a sanitized version of the schema you were inserting into?
There's a benchmark that exists to test a storage engine's sustained insert performance called iibench. Results for InnoDB are available here (spoiler: it ain't all pretty):
http://mysqlha.blogspot.com/2008/12/innodb-insert-performance.html
The random IO of updating B-tree indexes has got to be one of the big factors for the slow down once you exceed buffer pool size, but I wonder if there's anything else to it that could potentially be improved. I know I've seen cases where table scans can just wipe the contents of the buffer pool because InnoDB doesn't have any mid-point insertion strategies to avoid this.
You might want to give the TokuDB storage engine a try since it is a perfect fit for your application. Take a look at the first graph in:
http://tokutek.com/mysql-performance-brief.pdf
Using the iiBench benchmark mentioned by Morgan, TokuDB sustained an insert rate of more than 16,000 rows per second with three fully random indexes on a 1B row table without any partitioning.
Contact me if you are interested at hotchkiss at tokutek dot com
You write: "I should mention at this point that we used file_per_table as well to make sure that each table had its own file rather than use a single innodb file"
Since you don't insert records in different tables at the same time, what was the main reasons which lead you to this decision?
Thanks
file_per_table is always a good idea. It allows you to move some tables off to a different file system. Also, if you drop a table (we drop tables older than 2 months), the space is reclaimed immediately instead of fragmenting the main innodb tablespace.
Lastly, if I need to clone database, I can cheat a little and only copy over tables that are critical to day to day system operation, and then copy over older tables when the system is idle.
Post a Comment